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Preface 


Big data is characterized by three fundamental dimensions: Volume, 
Velocity, and Variety, The Three V’s of Big Data. The Volume 
expresses the amount of data, Velocity describes the speed at which data 
is arriving and being processed, and Variety refers to the number of 
types of data. 

The data coulcl come frorn anywhere, including social media, various 
sensors, hnancial transactions, etc. IBM has stated 1 that people create 
2.5 quintillion bytes of data every day, this number is growing 
constantly and most of it cannot be stored and is usually wasted 
without being processed. Today, it is not uncommon to process terabyte- 
or petabyte-sized corpora and gigabit-rate streams. 

On the other hand, nowadays every company wants to fully 
understand the data it has, in order to find value and act on it. This led 
to the rapid growth in the Big Data Software market. However, 
the traditional technologies which include data structures and 
algorithms, become ineffective when dealing with Big Data. Therefore, 


1 What Is Big Data? https://www.ibm.com/software/data/bigdata/what-is-big-data.html 
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many Software practitioners, again and again, refer to computer Science 
for the most appropriate Solutions and one option is to use probabilistic 
data structures and algorithms. 

Probabilistic data structures is a common name for data structures 
based mostly on different hashing techniques. Unlike regular (or 
deterministic) data structures, they always provide approximated 
answers but with reliable ways to estimate possible errors. Fortunately, 
the potential losses and errors are fully compensated for by extremely 
low memory requirements, constant query time, and scaling, the factors 
that becorne essentia! in Big Data applications. 


About this book 

The purpose of this book is to introduce technology practitioners which 
includes Software architects and developers, as well as technology 
decision makers to probabilistic data structures and algorithms. Reading 
this book, you will get a theoretical and practical understanding of 
probabilistic data structures and learn about their common uses. 

This is not a book for scientists, but to gain the most out of it you 
will need to have basic mathematical knowledge and an understanding 
of the general theory of data structures and algorithms. If you do not 
have any “computer Science” experience, it is highly recommended you 
read Introduction to Algorithms by Thomas H. Cormen, Charles E. 
Leiserson, Ronald L. Rivest, and Clifford Stein (MIT), which provides 
a comprehensive introduction to the modern study of computer 
algorithms. 

While it is impossible to cover all the existing amazing Solutions, 
this book is to highlight their common ideas and important areas of 
application, including membership querying, counting, stream mining, 
and similarity estimation. 



IX 


Organization of the book 

This book consists of six chapters, each preceded by an introduction 
and followed by a brief summary and bibliography for further reading 
relating to that chapter. Every chapter is dedicated to one particular 
problem in Big Data applications, it starts with an in-depth explanation 
of the problem and follows by introducing data structures and algorithms 
that can be used to solve it efficiently. 

The first chapter gives a brief overview of popular hash functions 
and hash tables that are widely used in probabilistic data structures. 
Chapter 2 is devoted to approximate membership queries, the most 
well-known use case of such structures. In chapter 3 data structures that 
help to estimate the number of unique elements are discussed. Chapters 
4 and 5 are dedicated to important frequency- and rank-related metrics 
computations in streaming applications. Chapter 6 consists of data 
structures and algorithms to solve similarity problems, particularly — 
the nearest neighbor search. 


This book on the Web 

You can find errata, examples, and addit ional informat ion at 
https://pdsa.gakhov.com. If you have a comment, technical question 
about the book, would like to report an error you found, or any other 
issue, send email to pdsa@gakhov.com. 

In case you are also interested in Cython implementation that includes 
many of the data structures and algorithms from this book, please 
check out our free and open-source Python library called PDSA at 
https://github.com/gakhov/pdsa. Everybody is welcome to contribute 
at any time. 
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1 

Hashing 


Hashing plays the Central role in probabilistic data structures as they 
use it for randomization and compact representation of the data. 
A hash function compresses blocks of input data of an arbitrary size by 
generating an identifier of a smaller (and in rnost cases fixed) size, called 
the hash value or sirnply the hash. 

The choice of hash functions is crucial to avoid bias. Although 
the selection decision is mostly based on the input and particular 
use cases, there are certain common properties that a hash function 
should fulfill in order to be applicable for hash-based selection. 
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Hash functions compress the input, therefore, cases where they generate 

the same hash values for two different blocks of data are unavoidable and 

known as hash collisions. 

In 1979 J. Lawrence Carter and Mark Wegman proposed the universal 
hash functions whose mathematical properties can guarantee a low 
expected number of collisions, even if the input data are chosen 
randomly from the universe. 

The universal hash functions farnily H rnaps elements of the universe 
to the range { 0,1 ,,m- 1 } and guarantees that by randomly picking 
a hash function from the farnily the probability of collisions is limited: 

Pr (hix) = h(y)') < —, for any x, y : x / y. (1-1) 

V / m 

Thus, the random choice of a hash function from the farnily with 
property ( 1 . 1 ) is precisely the same as choosing an element uniformly 
at random. 

An important universal hash functions farnily, designed to hash integers, 
can be defined as 

h{k,q}i x ) = ((& • x + q) mod p) mod m, ( 1 . 2 ) 

where k and q are randomly chosen integers modulo p with k ^ 0 . 
The value of p should be selected as a prime p > m, and the common 
choice is to take one of the known Mersenne prime numbers, e.g., for 
m = 10 9 we choose p = M 31 = 2 31 - 1 « 2 • 10 9 . 

Many applications can use the simpler version of the farnily ( 1 . 2 ): 

h{k}(x) = (k ■ x mod p) mod m, (1.3) 

this is only approximately universal, but stili provides a good probability 
of collisions smaller than — in expectation. 

However, the above families of hash functions are limited to integers, 
that is not enough for most practical applications which require to 
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hash variable-sized vectors and are in dernand of fast and reliable hash 
functions with certain guaranteed properties. 

There are many classes of hash functions used in practice and the choice 
mainly depends on their design and particular use. In the current 
chapter we provide an overview of popular hash functions and simple 
data structures that are prevalent in various probabilistic data structures. 


1.1 Cryptographic hash functions 

Practically, cryptographic hash functions are defined as hxed mappings 
frorn variable input bit strings to fixed length output bit strings. 

As stated previously, hash collisions are unavoidable, but a secure hash 
function is required to be collision resistant , meaning that it should be 
hard to hnd collisions. Of course, a collision can be found accidentally 
or computed in advance. This is why such a class of functions always 
requires mathematical proofs. 

Cryptographic hash functions are very important in cryptography and 
are used in many applications such as digital signatures, authentication 
schemas, and message integrity. 

There are three main requirements that cryptographic hashes are 
expected to satisfy: 

• Work factoi -to make brute force inversion hard, a cryptographic 

hash should be computationally expensive. 

• Sticky state — cryptographic hash should not have a state in 
which it can stick for a plausible input pattern. 

• Diffusion — every output bit of a cryptographic hash should be 
an equally complex function of every input bit. 

Theoretically, cryptographic functions can be further divided into 
keyed hash functions , that use a secret key, and unkeyed hash functions , 
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which do not. Probabilistic data structures use only unkeyed hash 
functions, which include One-Way hash functions, Collision Resistant 
hash functions, and Universal One-Way hash functions. These functions 
differ only in sorne additional properties. 

One-Way hash functions satisfy the following requirements: 

• They can be applied to blocks of data of any length (of course, 
in practice, it’s bounded by some huge constant). 

• They produce a fixed-length output. 

• They should have preimage resistance ( one-way property ) — it 
should be computationally infeasible to find an input which hashes 
to the specified output. 

Additionally, for Collision Resistant hash functions it should be 
extremely unlikely for two different inputs to generate the same hash 
value. 

If not collision resistant, Universal One-Way hash functions need to 
be target collision resistant or second-preimage collision resistant — it 
should be computationally infeasible to find a second distinet input that 
hashes to the same output as the specified input. 

Note, that being collision resistant implies that the function is 
second-preimage resistant, but the generic complexity of finding 
a second-preimage resistance function is much higher than finding 
a colliding pair. 


Because of their design (particularly, the work factor requirement), 
cryptographic hash functions are much slower than non-cryptographic 
ones. For instance, the function SHA-1, discussed below, is in the order of 
540 MiB/second 1 , but the popular non-cryptographic functions are in 
the order of 2500 MiB/second and more. 


1 CryptoH—[- 6.0.0 Benchmarks https://www.cryptopp.com/benclimarks.htnil 
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Message—Digest Algorithms 

The popular Message Digest Algorithm, MD5, was invented by Ron 
Rivest in 1991 to replace the old MD4 Standard. It is a cryptographic 
hash algorithm, dehned in IETF RFC 1321, that takes a message of 
an arbitrary length and produces as an output the uni que 128-bit hash 
of the input. 

The MD5 algorithm is based on the Merkle-Damgard schema. At 
the first stage, it converts the input of an arbitrary size to a nurnber of 
blocks of a fixed size (512-bit blocks or sixteen 32-bit words) using an MD- 
compliant padding function. Afterwards, such blocks are processed one 
by one using a special compression function and every next block uses 
the resuit of the previous output. To make the compression secure, 
the algorithm applies Merkle-Damgard strengthening, then the padding 
uses the encoded length of the original message. The final MD5 hash 
digest is the 128-bit value generated after the processing of the last block. 

The MD5 algorithm is often used to verify the integrity of a file — 
instead of conhrming that the file is unchanged by examining its raw 
data, it is enough to compare the MD5 hashes. 

As stated in Vulnerability Note VU#836068 2 , the MD5 algorithm is 
vulnerable to collision attacks. The discovered weaknesses in the algorithm 
allow for the construction of different messages with the same MD5 hash. 

As a resuit, attackers can generate cryptographic tokens or other data that 
illegitimately appears authentic. It is not advisable to use it as a secure 
cryptographic algorithm anymore, however, such vulnerability doesn’t 
have a big impact for probabilistic data structures and can stili be used. 


Secure Hash Algorithms 

Secure Hash Algorithms were developed by the US National Security 
Agency (NSA) and published by the National Institute of Standards and 


VU#836068 http://www.kb. cert.org/vuls/id/836068 
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Technology (NIST). The first algorithm from the family, called SHA-0, 
was published in 1993 and quickly replaced by its successor SHA-1, 
which became widely accepted globally. SHA-1 produces a longer 160-bit 
(20-byte) hash value, while its security has been increased by fixing 
the weaknesses of SHA-0. 

SHA-1 was widely used for years in various applications, and most 
websites were signed using algorithms based on it. However, in 2005 
a weakness in SHA-1 was discovered, so in 2010 NIST deprecated it for 
government use and it also got deprecated on the Internet since 2011. 
Same as with MD5, the found weaknesses didn’t impinge on its usage as 
a hash function for probabilistic data structures. 

SHA-2 was published in 2001 and included six hash functions with 
varying digest sizes: SHA-224, SHA-256, SHA-384, SHA-512, and 
others. SHA-2 is stronger than SHA-1 and attacks made against SHA-2 
are unlikely to happen with current computing power. 

RadioGatun 

The cryptographic hash function family called RadioGatun was presented 
at the Second Cryptographic Hash Workshop in 2006 [Be06]. The design 
of RadioGatun improved the known Panama hash function. 

Similar to other popular hash functions, the input is split into 
a sequence of blocks which are injected into the algorithm’s internal 
state using a special function, that is followed by an iterative application 
of a single non-cryptographic round function (called the belt-and-mill 
round function). At every round, the state is represented as two parts, 
the belt and the mill, that are treated differently by the round function. 
The application of the round function consists of four operations in 
parallel: 1) non-linear function applied to the mill, 2) simple linear 
function applied to the belt, 3) feedforward some bits of the mill to 
the belt in a linear way, 4) feedforward some bits of the belt to the mill 
in a linear way. After injection of all input blocks, the algorithm 
performs a number of rounds without input or output (blank rounds) 
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after which a part of the state is returned as the final hash value. 

Among the family, RadioGatun64, with 64-bit words, is the default 
choice and is optimal for 64-bit platforms. For best performance on 
32-bit platforms, RadioGatun32, with 32-bit words, can also be used. 


For the same clock frequency, RadioGatun32 is claimed to be 12 times faster 
than SHA-256 for long inputs, and 3.2 times faster for short inputs,while 
having fewer gates. RadioGatun64 is even 24 times faster than SHA-256 
for long inputs but has about 50% more logic gates. 


1.2 Non-Cryptographic hash functions 

In contrast to cryptographic hash functions, non-cryptographic functions 
are not designed to fend off attacks aimed at finding a collision, hence 
don’t require security and high collision resistance. 

Such functions sirnply have to be fast and guarantee a low probability 
of collisions, allowing a lot of data to be quickly hashed with a reasonable 
error probability. 

Fowler/Noll/Vo 

The basis of the Fowler/Noll/Vo (FNV or FNV1) non-cryptographic 
hash algorithm was taken frorn an idea sent, as a reviewer comment, to 
the IEEE POSIX P1003.2 committee by Glenn Fowler and Phong Vo 
back in 1991 and afterward improved on by Landon Curt Noli [Fol8]. 

The FNV algorithm maintains an internal state that is initialized 
to a special offset basis. After that, it iterates over the input blocks 
of 8 bits and performs the multiplication of the state on some large 
numerical constant, called the FNV Prime, followed by applying logical 
exclusive OR (XOR) to the input block. After the last input is processed, 
the resulting value of the state is reported as the hash. 
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The FNV Prime and the offset basis constants are design parameters 
and depend on the bit length of the produced hash values. As mentioned 
by Landon Curt Noli, the selection of the primes is the part of the magic 
of the FNV algorithm, and sorne primes do hash better than others for 
the sarne hash size. 

The FNVla alternate algorithm, that currently has to be preferred, is 
a minor variation of the FNV algorithm that differs only in the order of 
the internal XOR and multiplication operations. Although FNVla uses 
the same parameters and the FNV Prime as the FNV1, its XOR-folding 
provides slightly better dispersion without interfering with the CPU 
per for mance. 

Currently, the FNV family includes algorithms for 32-, 64-, 128-, 256-, 
512-, and 1024-bit hash values. 

The FNV is very simple to implement, but its high dispersion of 
the hash values makes them well suited for hashing nearly identical 
strings. It is widely used in DNS servers, Twitter, database indexing 
hashes, web search engines, and many other places. Some years ago, 
the 32-bit version of the FNVla was recommended as the hash algorithm 
for IPv6 flow label generation [Ani2]. 


Murmur Hash 

Another well-known family of hash functions, called MurmurHash, was 
published by Austin Appleby in 2008 and finalized as the MurmurHash3 
algorithm in 2011 [Api 1]. 

The MurmurHash algorithms use a special probabilistic technique for 
approximating the global optimum to find a hash function that mixes 
the bits of the input value in the best way to produce the bits of the output 
hash. The various generations of the algorithm differ mainly in their 
rnixing functions. 

The algorithm is claimed to be twice as fast as the speed-optimized 
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lookup3 hash function 3 . MurmurHash3 includes 32- and 64-bit versions 
for x86 and x64 platforms. 

Currently, MurmurHash3 is one of the most popular algorithms and is 
used in Apache Hadoop, Apache Cassandra, Elasticsearch, libstdc-l—1-, 
nginx, and others. 

City Hash and FarmHash 

In 2011, Google published a new family of hash functions for strings, 
called CityHash, developed by Geoff Pike and Jyrki Alakuijala [Pili]. 
CityHash functions are simple non-cryptographic hash functions that are 
based on the MurmurHash2 algorithm. 

The CityHash family were developed with the focus on short strings 
(e.g., up to 64 bytes) that have the most interest in probabilistic data 
structures and hash tables. It includes 32-, 64-, 128- and 256-bit versions. 
For such short strings, the 64-bit version CityHash64 is faster than 
MurmurHash and outperforms the 128-bit CityHashl28. However, while 
for long strings with at least a few hundred bytes the CityHashl28 is 
preferred over other hash functions of the CityHash family, in practice, 
it is better to use MurmurHash3 instead. 

One of the downsides of the CityHash is that it is fairly complex and 
leads to non-optimal behavior on different compilers that can significantly 
degrade its speed. 

In 2014 Google published a successor to CityHash called 
the FarmHash , developed by Geoff Pike [Pil4]. The new algorithm 
included most of the techniques used in CityHash (and, unfortunately, 
inherited its complexity) and the new generation of MurmurHash. 
FarmHash functions mix the input bits thoroughly, but it is not enough 
to be used in cryptography. 

The FarmHash uses CPU specific optimizations and stili requires tuning 
of the compiler to get the best performance and is platform dependent. 


Hash Functions and Block Ciphers https://burtleburtle.net/bob/hash/ 
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Notably, the computed hash values also differ across platforms. 

The FarmHash functions come in many versions, and the 64-bit version 
Farm64 outperforms algorithms such as CityHash, MurmurHash3, and 
FNV in tests on many platforms, including mobile phones. 


1.3 Hash tabies 

A hash table is a dictionary data structure that is comprised of unordered 
associative array of length m whose entries are called buckets and are 
indexed by a key in the range — 1}. To insert an element 

into the hash table, a hash function is used to compute the key that is 
utilized to select the appropriate bucket to store the value. 

Typically, the universe from which we draw the input elements is much 
bigger than the capacity m of the hash table, hence collisions in keys 
are unavoidable. Additionally, when the number of elements in the hash 
table grows, the number of collisions rises as well. 

The critical concept of hash tables is the load factor a, the ratio of 
the number of used keys n to the table’s total length m: 

n 

a := —. 
m 

The load factor is a measure of how full the hash table is and since n 
cannot exceed the capacity of the hash table it is upper bounded by 
one. When a approaches its maximal value, the probability of collision 
increases significantly which can necessitate an increase in capacity. 

All hash table implementations need to address the problem of collisions 
and provide a strategy on how to handle them. There are two main 
techniques: 

• Closed addressing — to store collided elements under the same 
keys in a secondary data structure. 
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• Open addressing — to store collided elements in positions other 
than their preferred positions and provide a way to address them. 

The closed addressing technique is the most obvious way to resolve 
collisions. There are many different implementations, for instance, 
separate chaining that Stores collided elements in a linked list, perfect 
hashing that uses special hash functions and secondary hash tables of 
different lengths. 

Instead of creating a secondary data structure in either form, it is 
possible to resolve collisions by storing the collided elements elsewhere in 
the primary table and providing an algorithm on how to address them. 
Since the address of the element is not known frorn the beginning, this 
technique is known as open addressing. 

Now we will cover two open addressing implementations that are useful 
in the probabilistic data structures listed in this book. 


Linear probing 

One of the most straightforward hash table implementations that uses 
open addressing is the Linear probing algorithm, invented by Gene 
Amdahl, Elaine M. McGraw, and Arthur Sarnuel in 1954 and analyzed 
by Donald Knuth in 1963. The idea of the algorithm is to place collided 
elements into the next ernpty bucket. Its name originates frorn the fact 
that the final position of the element will be linearly shifted frorn 
the preferable bucket since we probe one bucket after another. 

A Linear Probing hash table can be seen as a circular array that 
stores indexed values in buckets. To insert a new element x, we compute 
its key k = h(x) using a single hash function h. If the bucket that 
corresponds to that key is non-empty and contains a different value, 
meaning a collision, we keep looking clockwise at the next buckets until 
we hnd a free space where we can index the element x. Monitoring of 
the load factor of the hash table can guarantee that we will definitely 
find a free space at sorne point. 
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Similarly, when we want to lookup for some element x, we compute 
its key k using the same hash function h and start checking the buckets 
clockwise, starting at the preferable bucket with the key k = h(x), 
until we found the wanted element x or the first ernpty bucket appears, 
resulting in the decision that the element is not in the table. 


Example 1.1: Linear probing 

Consider a LinearProbing hash table of length m = 12 and a hash 
function based on 32-bit MurmurHash3 that maps the universe to the range 
{0,1,..., m-1}: 


h{x) := MurrnurHash3 (x) mod m. 

Suppose that we want to store different names of colors in the hash table, 
starting from red. The value of the hash function for the element is 

h = h(red) = 2352586584 mod 12 = 0. 

Since the LinearProbing hash table is ernpty at the beginning, the bucket 
with the key k = 0 contains no elements, therefore we just index the element 
there: 



Next, we take the element green, whose hash value is 
h = h(green) = 150831125 mod 12 = 5. 


The key is k = 5, as tliis bucket is ernpty we again freely store the element. 
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Now, consider the element white. Its hash value is 

h = h(white) = 16728905 mod 12 = 5. 

The preferable bucket for that element is the one with the key k = 5. 
However, the bucket is already occupied by a different element, meaning 
a collision has appeared. In this case, we apply the Linear probing 
algorithm and try to find the next ernpty bucket going clockwise from 
the preferable bucket position. Fortunately, the next bucket, under key 
k = 6 is free and we store the element white there. 



When we lookup for the element white in the LinearProbing hash table, 
we first check its preferable bucket, with the key k = 5. Since that bucket 
contains a value that differs from the element, we start checking buckets 
in a clockwise direction, starting from the key k + 1 = 6. Fortunately, 
the next bucket with the key k = 6 contains the wanted value and we can 
conclude that the element is present in the hash table. 


The algorithm requires 0(1) time for each operation, as long as 
the LinearProbing hash table is not full (the load factor is strictly 
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less than one). The longest probe sequence in Linear probing is of 
expected length O(logn). 


The Linear probing algorithm is very sensitive to the choice of the hash 
function h because it must provide ideal uniform distribution. 
Unfortunately, in practice, it is not possible, and the performance of 
the algorithm degrades rapidly as the actual distribution diverges. To 
address tlris problem, a variety of teclmiques for additional randomization 
are widely used. 


Cuckoo hashing 

Another implementation of open addressing is Cuckoo hashing , introduced 
by Rasmus Pagh and Flemming Friche Rodler in 2001 and published 
in 2004 [Pa04]. The main idea of the algorithm is to use two hash 
functions instead of one. 

The Cuckoo hash table is an array of buckets, where instead of one 
preferable bucket as in Linear probing and many other algorithms, each 
element has two candidate buckets determined by two different hash 
functions. 

To index a new element x into the Cuckoo table, we compute keys 
for two candidate buckets with the hash functions hi and /i 2 - If at least 
one of those buckets is empty, we insert the element into that bucket. 
Otherwise, we randomly choose one of those buckets and store element x 
there, while moving the element from that bucket to its alternative 
candidate bucket. We repeat this procedure until an empty bucket is 
found, or until a maximum number of displacements is reached. If there 
are no empty buckets, the hash table is considered full. 

Although Cuckoo hashing may execute a sequence of displacements, it 

keeps the constant time 0(1) to be finished. 


The lookup procedure is straightforward and can be done in constant 






1.3 Hash tables 


15 


time. We simply need to determine the candidate buckets for the input 
element by computing its hashes h± and h 2 and check if such an element is 
present in one of those buckets. The deletion procedure can be performed 
in a similar way. 


Example 1.2: Cuckoo hashing 

Consider a CUCKOO hash table of length m = 12 with two 32-bit 
hash functions MurmurHash3 and FNVla that produce values in 
the range {0, 1 ,..., m - 1 }: 

h ±( x ) := MurmurHash3(a;) mod m , 
h 2 ( x ) := FNVla(a;) mod m . 

Like in Example 1.1, we index color names in the hash table starting 
with red. The keys of the candidate bucket we obtain by applying those 
hash functions: 


h\{red ) = 2352586584 mod 12 = 0, 
h 2 {red) = 1089765596 mod 12 = 8. 

The CUCKOO hash table is empty, so we use one of the candidate buckets, 
for instance, the bucket with the key k = h\{red ) = 0 and index 
the element. 
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Next, we index element black whose candidate buckets are hi(black) = 6 
and h 2 (black) = 0. Since the bucket with the key k = 0 is occupied by 
another element, we can only index it into the alternative bucket k = 6, 
which is free. 
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There is a similar situation with the element silver with hi(silver) = 5 
and h 2 (silver ) = 0. We store this element in the bucket with the key 


k = 5 since 0 is occupied. 
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Now consider the element white. The hash values of this element are 

hi(white) = 16728905 mod 12 = 5, 

Ii 2 (white) = 3724674918 mod 12 = 6. 

As we can see, both candidate buckets for this element are occupied, and we 
have to perform the displacements according to the Cuckoo hashing schema. 
First, randomly select one of the candidate buckets, let’s say the bucket 
with the key k = 5 and put the element white into it. The element silver 
from the bucket 5 has to be relocated to its alternative bucket, which 
is 0. As we can see, the bucket with the key 0 is not empty; therefore, 
we store element silver and move element red from that bucket to its 
other candidate bucket. Fortunately, the alternative bucket with the key 
8 for element red is free and after storing it in that bucket, we finish 
the insertion procedure. 
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For instance, when we want to lookup the element silver , we check only 
its candidate buckets, which are 5 and 0, as we computed earlier. Since this 
element is present in one of them, in the bucket with the key 0 in this case, 
we conclude that the element silver is present in the CUCKOO hash table. 


Cuckoo hashing ensures high space occupancy but requires the length 
of the hash table to be slightly larger than the space needed to keep 
all elements. A modification of the Cuckoo hash schema is used in 
a probabilistic data structure called the Cuckoo filter, which we will 
describe in detail in the next chapter. 


Conclusion 

In this chapter we covered an overview of hashing, its problems and 
importance in data structures. We discussed cryptographic versus non- 
cryptographic hash functions, reviewed a list of the functions that are 
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most used in practice, and learned about universal hashing which is 
very important theoretically. As an application of the hash functions we 
have considered hash tables, which are simple data structures that nrap 
keys to values and answer membership queries. We studied examples 
of open addressing hash tables that we will use in the next chapters for 
probabilistic data structures. 

If you are interested in more information about the material covered 
here, please take a look at the list of references that follows this chapter. 

In the next chapter we will be discussing first probabilistic data 
structures and studying extensions of hash tables, called filters, that are 
used to answer membership queries under requirements that are 
common for Big Data applications, such as when storage is at 
a premium and the speed of lookups rnust be as fast as possible. 
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2 

Membership 


A membership problem for a dataset is a task to decide whether some 
element belongs to the dataset or not. For small sets, it could be 
solved by direct lookup and subsequent comparison of the given element 
to each element in the set. However, such a naive approach depends 
on the number of elements in the set and takes on average O(logn) 
comparisons (on pre-sorted data), where n is the total number of elements. 
It is obvious that for huge sets of elements, which are operated by Big 
Data applications, this approach is not efficient and requires too much 
time and O(n) memory to store the elements. 

Possible workaround Solutions like chunking such sets and running 
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comparisons in parallel can help in the reduction of computation time. 
However, it is not always applicable because for big data processing to 
store such huge sets of elements is almost an unachievable task. 

On the other hand, in many cases, it isn’t necessary to know exactly 
which element from the set has been matched, only that a match has been 
made and, therefore, it is possible to store only signatures of the elements 
rather than the whole value. 


Example 2.1: Safe-browsing problem 

Imagine, we develop a web-browser and notice that some URLs are known 
to contain malware, thus we want to alert users (or even prevent them from 
visiting) if they try to navigate to those pages. An immediate solution, that 
minimizes the network traffic, is to store all such URLs in the application 
and after the user enters the URL just check if it’s not known as malware 
and can be safely navigated to. 

Such a naive implementation will work quite well while the number of 
bad URLs is small. That is unfortunately not the case for real-world 
applications. After some time, we will need a special structure that can 
store bad URLs (or, ideally, only some information about them) without 
growing linearly in size when a new URL is introduced. Other requirements 
include that it should support the check of wlretlier a URL is listed and it 
should be as fast as possible since we don’t want users to wait for a long 
time. 


Applications of the membership problem are not specific to pure 
computer Science and play an essential role in various branches. 


Example 2.2: DNA sequences (Strannehcim et al., 2010) 

One important issue in metagenomic studies is the classification of 
sequences either as “novel” or belonging to a known genome, i.e., filtering 
out data that has been seen before. 

A preprocessing step that executes membership queries, if performed 
efficiently, can reduce the complexity of the data before more careful 
analysis is performed. 
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The problem of fast lookup can be solved using hashing, which is also 
the simplest way to do that. With a hash function, every element of 
the dataset can be hashed into a hash table that maintains a (sorted) 
list of hash values. However, such an approach yields a small probability 
of errors (caused by possible hash collisions) and requires about 0(log n) 
bits per each hashed element, which can stili be unfeasible in practice 
for huge datasets. 

In this chapter, we consider popular alternatives to regular hash tables 
that require less space, rnake faster lookups, and maintain smaller error 
probabilities. Such space-efficient data structures help to handle a big 
volurne of data and allow for the execution of membership queries with 
good per for mance. 

We start on the farnous Bloom filter, then learn about its extensions 
and modifications, and finally, study its modern alternatives. 


2.1 Bloom filter 

The simplest and most well-known data structure that solves 
the membership problem is the Bloom filter which was proposed by 
Burton Howard Bloom in 1970. It is a space-efficient probabilistic data 
structure for representing a dataset D = {x\, ..., x n } of n elements 

that supports only two operations: 

• Adding an element into the set, and 

• Testing whether an element is or is not a rnember of the set. 

The Bloom filter can store a large set very efficiently by discarding 
the identity of the elements; it stores only an (almost) unique set of 
bits corresponding to some number of hash functions that are applied to 
the element by the algorithm. 

Practically, the Bloom filter is represented by a bit array and can be 
described by its length m and number of different hash functions {/q}f = j • 
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It is assumed that m is proportional to the number of expected elements 
n, while k is much smaller than m. 

Hash functions hi should be independent and uniformly distributed. 
In this way, we randomize the hash values uniformly (you can think of it 
as using hash functions as some kind of random-number generator) in 
the filter and decrease the probability of hash collisions. 

Such an approach drastically reduces the storage space and, regardless 
of the number of elements in the data structure and their size, requires a 
constant number of bits by reserving a few bits per element. 

The BloomFilter data structure is a bit array of length m where at 
the beginning all bits are equal to zero, meaning the filter is ernpty. To 
insert an element x into the filter, for every hash function we compute 
its value j = hk{x) on the element x and set the corresponding bit j in 
the filter to one. Note, it is possible that some bits can be set multiple 
times due to hash collisions. 

Algorithm 2.1: Adding element to the Bloom filter 
Input: Element x E B 

Input: Bloom filter with k hash functions {/i?;}f =1 
for i ■(— 1 to k do 
j <- hi(x) 

BloomFilter[j] 1 


Example 2.3: Add elements to the filter 

Consider a Bloom filter with length m = 10 and two 32-bit hash functions 
MurmurHash3 and FNVla to produce values in the range {0,1,..., m- 1}: 

hi(x) := MurmurHash3(a;) mod m, 
hvix) := FNVla(r) mod m. 


The ernpty filter has the following form: 
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As an example, we insert the names of capital cities into the filter. Let’s 
start with Copenhagen and in order to find the corresponding bits in 
the filter we compute its hash values: 

hi(Copenhagen) = MurmurHash3( Copenhagen) mod 10 = 7, 
h 2 (Copenhagen) = FNVla( Copenhagen) mod 10 = 3. 

Hence, we need to set bits 3 and 7 in the filter: 
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It is possible that different elements can share corresponding bits, for 
instance let’s add another element, Dublin , to the filter: 

hi(Dublin) = MurmurHash3(£)tt6Kn) mod 10 = 1, 
^(Dublin) = FNVla (Dublin) mod 10 = 3. 

As you can see, its corresponding bit-positions in the filter are 1 and 
3, where only bit 1 hasn’t been set yet (meaning that some element in 
the filter that is not Dublin has 3 as one of its corresponding bits): 
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When we need to test if the given element x is in the filter, we 
compute all k hash functions hi = {hi{ x)}f =1 and check bits in 
the corresponding positions. If all bits are set to one, then the element 
x may exist in the filter. Otherwise, the element x is definitely not 
in the filter. The uncertainty about the element’s existence originates 
from the possibility of situations when some bits are set by different 
previously added elements (as we saw in Example 2.3) or, due to hard 
collisions, when all hash functions collide accidentally. 
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Algorithm 2.2: Testing element in the Bloom filter 

Input : Element x G B 

Input : Bloom filter with k hash functions 

Output : False if element not found and Time if element may exist 
for i •(— 1 to k do 

j e- hi(x) 

if BloomFilter[j] ^ 1 then 
| return False 

return True 


Example 2.4: Test elements in the filter 

Consider the Bloom filter from Example 2.3 with two indexed elements, 
Copenhagen and Dublin: 
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To test if the element Copenhagen is in the filter, we again need to 
compute its hash values hi(Copenhagen) = 7 and h, 2 (Copenhagen ) = 3. 
After that, we clreck the corresponding bits in the filter and see that both 
of them are set to one, therefore we claim that Copenhagen may exist 
in the filter. 

Now we will consider an element Rome and compute its hashes in order 
to find the corresponding bits in the filter: 

hi(Rome) = MurmurHash3 {Rome) mod 10 = 5, 
h 2 (Rome) = FNVla (Rome) mod 10 = 6. 

Thus, checking the bits 5 and 6, we see that bit 5 isn’t set, therefore 
the element Rome is definitely not in the filter and we don’t even need 
to check bit 6. 

However, the filter can also resuit in a false positive answer. Consider 
element Berlin , whose hash values are 

hi(Berlin) = Murmur Bash3( Berlin) mod 10 = 1, 
hviBerlin) = FNVla (Berlin) mod 10 = 7. 
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The corresponding bits 1 and 7 are both set in tlre filter, hence tlre resuit of 
the test function is that the element may exist in the filter. At the same 
time, we know that we lrave not added that element and this is an example 
of a hash collision. Note, in this particular case bit 1 was set by hi{Dublin) 
and bit 7 by hi(Copenhagen). 


If each hash function {hi}^ =1 can be computed in a constant time (which 
is true for all the most popular hash functions), the time to add a new 
element or test an element is a fixed constant O (k) and independent from 
the filter’s length m and the number of elements in the filter. 

The performance of the Bloom filter is highly dependent on the chosen hash 
functions. A hash function with a good uniformity will reduce the practically 
observed false positive rate. On the other hand, the faster the computation 
of each hash function, the smaller the overall time of each operation, and 
it is therefore recommended to avoid cryptographic hash functions. 


Example 2.5: Prevent compromised passwords (Spafford, 1991) 

Consider a web Service registration page where we want to prevent users 
from choosing weak and compromised passwords. Note, that in the Dark 
Web it is possible to find hundreds of millions 1 hacked passwords that can 
be used in a dictionary attack, a brute force attack that makes repetitive 
attempts to defeat an authentication by trying all values from a pre- 
arranged listing. Thus; every time the user types a new password, we 
would like to ensure it is not in such a list. However, along with the lack 
of security related to storing raw passwords, we don’t want to maintain 
a huge dataset that grows linearly with every newly added password which 
will slow down our lookups (as in traditional databases). 

Therefore, the usage of a space-efficient Bloom filter is essential. The false 
positive event, in this case, is a situation where we mistakenly think that 
the entered password is unsuitable. In such rare cases, we need to ask 
the user to type another password and that usually doesn’t hurt. 


1 1.4 Billion Ciear Text Credentials Discovered https://medimn.com/4iqdelvedeep/3131d0alael4 
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Count unique elements in the filter 

A method to estimate the number of unique elements indexed into 
the filter was proposed by S. Joshua Swamidass and Pierre Baldi and, in 
fact, is an extension of the Linear Counting algorithm that is discussed 
in the next chapter. Using the information about the number of set bits 
in the filter and the probability of each bit to be set, it provides a simple 
formula to approximate the number of elements in the filter. Since two 
identical elements added into the filter won’t change the number of set 
bits, such approximation gives an estimation for the number of unique 
elements (known as cardinality). 


Algorithm 2.3: Counting unique elements in the Bloorn filter 
Input : Bloorn filter of length m with k hash functions 
Output : Number of unique elements in the filter 
N count (BloomFilter[)] = 1) 

if N < k then 
return 0 

if N = k then 
j return 1 

if N = m then 
|_ return f 

return-f-ln(l- £) 


Properties 

False positives are possible. As has already been mentioned, 
the Bloorn filter doesn’t store elements and hardly relies on 
the calculated hashes which are stored all together in a bit array. Such 
space-efficient representation can lead to situations where sorne element 
is not a rnember (wasn’t added to the filter), but the algorithm returns 
like it is. Such an event is called a false positive and can occur because 
of hash collisions or due to mess in stored bits — in the test operation 
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there is no prior knowledge of whether the particular bit has been set by 
the sarne hash function as the one we compare with. 


The Bloom filter principle [Br04]: Wherever a list or set is used, and 
space is at a premium, consider using a Bloom filter if the effect of false 
positives can be mitigated. 

Fortunately, such false positive situations rarely happen and their 
probability Pf p can be easily estimated (actually, this is a lower bound): 

Pfp^l-e-^. (2.1) 


As we can see from (2.1), under the fixed number of expected elements n, 
the probability of false positives depends on the choice of k and m. It 
is a ciear trade-off between the length of the filter, the number of hash 
functions, and the probability of such events. 

In the extreme case, when the filter is full (meaning all bits are set), 
every lookup will yield a (false) positive response. It means that the choice 
of m depends on the (estimated) number of elements n that are expected 
to be added, and m should be quite large compared to n. 


In practice, the length of the filter m, under given false positive 
probability Pf p and the expected number of elements n, can be determined 
by the formula: 


m = 


n ln Pf p 
(ln 2) 2 ' 


( 2 . 2 ) 


Thus, a filter must grow linearly with the number of elements to keep 
the fixed false positive probability. 

For the given ratio of —, meaning the number of allocated bits per 
element, the false positive probability can be tuned by choosing 
the number of hash functions k. The optimal choice of k is computed by 
minimizing the probability of false positives in (2.1): 
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In othcr words, thc optimal number of hash functions k is about 0.7 times 
the number of bits per element. Since k must be an integer, the smaller 
sub-optimal values are preferred. 

Some widely used near-optimal Solutions can be found below. 

Table 2.1: Near-optimal choices of parameters 


k 

m 

n 

Pfp 

4 

6 

0.0561 

6 

8 

0.0215 

8 

12 

0.00314 

11 

16 

0.000458 


Example 2.6: Parameters estimation 

According to (2.2), to support the false positive probability Pf p = 1% 
the filter has to be 10 times longer than the expected number of elements 
n and use 6 hash functions. However, the length of the filter doesn’t 
depend on the size of elements themselves and stays the same for elements 
of different natures. 


Bloom filters can be seen as a generalization of hash tables. In fact, 
a filter with only one hash function is equivalent to a hash table. However, 
using many hash functions, Bloom filters can maintain the constant false 
positive probability even for a hxed number of bits per element, while 
hash tables cannot. 

False negatives are not possible. In contrast to the situation above, 
if the Bloom filter returns that a particular element isn’t a member, then 
it’s definitely not a member of the set: 

Pfn = 0. (2.4) 

Works well while it fits in memory. As already mentioned above, 
the probability of false positives can be decreased by allocating more 
memory, this is why people tend to create bigger filters (with larger m). 
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However, sudi classical Bloom filters work well while they fit in the main 
memory. As soon as they grow too big and have to be moved on disk, 
they immediately hit the problem induced by the design — uniformly 
distributed hash functions produce random corresponding indices which 
need to be randomly accessed every time — very unpleasant for disks 
with rotating platters and moving heads (for solid-state storage devices 
it is much better, but stili not perfect). 


Examplc 2.7: Required memory 

According to (2.2), to handle 1 billion elements and keep the probability 
of false positive events at about 2% using the optimal nurnber of hash 
functions, we need to choose a filter of m = 10 9 • ln (0.02)/(ln2) 2 « 

8.14 • 10 9 bits long that is roughly 1 GB of memory. 


Two different Bloom filters of the same length can be merged only if 
they also have the same hash functions. In this case, the merge is just 
a bitwise OR operation and the resuit is a full equivalent to the Bloom 
filter built for the union of those two sets of elements. The intersection 
of those two Bloom filters is also possible and can be done by bitwise 
XOR, however, the resuit can have a higher probability of collisions. 

Unfortunately, it is not possible to adjust the size of a Bloom filter when 
it runs out of space without recomputing all the hashes that are already 
in the filter, that is unlikely to be feasible in Big Data applications. 


Examplc 2.8: Cache sharing (Fan, 2000) 

Consider a list of distributed caching proxies Pi, P 2 , • ■., P„ in a network 
that shares their caches. If the content of the requested URL is stored on 
a proxy Pj, then the proxy returns it without an actual call to the remote 
server, otlrerwise, the content will be retrieved, stored locally, and sent to 
the client. 

With the goal of minimizing the network traffic and distributing the storage, 
we can set up a routing in that proxy network and attempt forward 
to a proxy that already has the content stored, if any, otherwise, call 
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the remote server. Since requests from the client can come to any proxy, 
there is a problem sharing the routing list on each proxy and when it’s 
changed exchanging it within the network or merging, if necessary. 

The Bloom filter is a natural choice to store sucli routing lists and perform 
fast membership queries. It is also easily transferable in the network due 
to its small size. 

The false positive event, in this case, is that some proxy Pj assumes that 
another proxy Pj may have the content for the requested URL, but in fact, 
it doesn’t. P,; routes traffic to Pj and asks it to return the content, therefore 
Pj has to call the remote server. As a resuit, it produces some additional 
network traffic and Stores redundant local copies for such content, which 
is completely acceptable. 


Deletion is not possible. To delete a particular element from 
the Bloom filter it would need to unset its corresponding k bits in 
the bit array. Unfortunately, a single bit could correspond to multiple 
elements due to hash collisions and shared bits between elements. 

A number of extensions have been developed that support deletion of 
elements, but they always cost through space and speed. This is why 
the classical Bloom filter is so fast and space-efficient. 

Fortunately, missing deletion support is not a problem for many 
real-world applications, but if you really need it you have to go for 
modifications of the Bloom filter, for example the Counting Bloom filter. 


2.2 Counting Bloom filter 

The most popular extension of the classical Bloom filter that supports 
deletion is the Counting Bloom filter, proposed by Li Fan, Pei Cao, 
Jussara Almeida, and Andrei Z. Broder in 2000 [FaOO]. Building on 
the classical Bloom filter algorithm, it introduces an array of m 
counters {C j}JLi corresponding to each bit in the filter’s array. 

The Counting Bloom filter allows approximating the number of times 
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each element has been seen in the filter by incrementing 
the corresponding counter every time the element is added. 
The associated CountingBloomFilter data structure contains a bit 
array and the array of counters of length m, all initialized to zeros. 

When we insert a new element into CountingBloomFilter, we 
first compute its corresponding bit-positions, then for each position we 
increment the associated counter and, only if it changes from zero to one, 
set the bit in the filter, similar to the step of the classical Algorithm 2.1. 


Algorithm 2.4: Adding element to the Counting Bloom filter 
Input: Element x € ID 

Input: Counting Bloom filter with m counters {Cj}”h 1 and k hash 
functions {hi}\ =l 
for i <— 1 to k do 

j <- hi(x ) 

Cj + i 

if C j = 1 then 

|_ CountingBloomFilter^'] 1 


The test operation looks exactly the same as for the classical Bloom 
filter Algorithm 2.2 since we don’t need to check counters at all. 
The amount of time required to test an element is comparable to 
the classical algorithm because the filters’ bit arrays are the same. 


Algorithm 2.5: Testing element in the Counting Bloom filter 
Input : Element igD 

Input: Counting Bloom filter with k hash functions {h?:}J = i 
Output : False if element not found and True if element may exist 
for i •(— 1 to k do 

j -e- hi{x) 

if CountingBloomFilter[j'] / 1 then 
[_ return False 

return True 
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The deletion is quite similar to the insertion but in reverse. To delete 
an element x, we compute ali k hash values hi = {hi(x)}\ =l and decrease 
the corresponding counters. If the counter changes its value from one to 
zero, the corresponding bit in the bit-array has to be unset. 


Algorithm 2.6: Deleting element from the Counting Bloonr filter 
Input: Element x G B 

Input: Counting Bloonr filter with m counters {C j}j=i and k hash 
functions {hi}\ =1 
for i •(— 1 to k do 
j -e- hi{x) 

Cj^Cj-1 
if Cj = 0 then 

j CountingBloomFilter[)] 0 


Algorithm 2.6 presupposes that element x exists (or may exist) in 
the filter, therefore it rnight be necessary to test elements before 
decreasing the corresponding counters. 


Properties 

The Counting Bloom filter inherits all the properties of the classical Bloom 
filter, including false positive error estimation and recomnrendations for 
the optimal choice of m and k given by the dependencies (2.2) and (2.3). 

Naturally, Counting Bloom filters are much bigger than classical Bloom 
filters because additional memory has to be allocated for the counters 
even if most of them are zeros. Therefore, it is important to estimate how 
large such counters can become and how their size depends on the filter’s 
length m and the nurnber of hash functions k. 

Assuming that every counter C has a capacity level N, the probability 
that the value goes over that capacity level (known as overflow probability) 
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in a Counting Bloom filter of length m with the optimal choice of k from 
the relation (2.3) is 


Pr ^ max (C) > N) 


< m ■ 



N 


(2.5) 


For instance, for 4-bit counters (N = 16) the overflow probability given by 
formula (2.5) becomes 

Pr ^ max (C) > 16^ < m • 1.37 • 1CT 15 . 

In otlrer words, if we allocate 4 bits per counter, the probability of overflow 
for practical values of m (e.g., a few billion bit-positions) during the initial 
insertion into the filter is extremely small. After many deletions and 
insertions, the probability can become a bit bigger, but is stili small enouglr 
for the practical usage. 

To prevent arithmetic overflow (i.e., incrementing a counter that already 
has the maximum possible value), each counter in the array must be 
sufficiently large in order to retain the properties of the Bloom hlters. In 
practice, the counter consists of 4 or more bits and a Counting Bloom 
filter, therefore, requires four times more space than a classical Bloom 
filter. 


It depends on the application, but if a 4-bit counter ever exceeds the value 
of 15 we can simply “freeze” it and let it stay at 15. After many deletions, 
this might lead to a situation where the Counting Bloom filter produces 
a false negative response (the counter becomes zero wlren it shouldn’t be), 
but the probability of sucli a cliain of events is so low that it is muclr more 
likely that our application will be rebooted and the filter re-created. 

However, it is possible to design a more complex version of the Counting 
Bloom filter with smaller counters (e.g., 2-bit) that uses less space and 
is stili practically useful by adopting an approach similar to closed 
addressing in hash tables and introducing a secondary hash table to 
rnanage overflowing counters. 
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Therefore, the Counting Bloom filter supports only probabilistically 
correct deletions because there is a non-zero probability of error as soon 
as sorne counter goes above its maximal size. 

Despite the noted peculiarities, Counting Bloom filters are extensively 
used by Apache Hadoop and Apache Spark in MapReduce applications 
to speed up the processing of huge datasets on big clusters by helping to 
reduce their volurne. 


2.3 Quotient filter 

When the classical Bloom filter and its modifications do not fit into 
the main memory they are entirely unfriendly to storage due to 
the requirements of multiple random accesses for any operation. One of 
the data structures that supports the basic operations of Bloom filters, 
but with better data locality and requiring only a small number of 
contiguous disk accesses, is the Quotient filter, presented by Michael 
Bender et al. in 2011 [Bell]. 

The filter achieves compar able per for mance regarding space and time, 
but additionally supports deletions and can be dynamically resized or 
rnerged. The narne of this data structure originates frorn an arithmetic 
quotient which is a resuit of the division operation. 

The Quotient filter represents a dataset D = {x\, X 2 , ■ ■ ■, x n } by 
storing a p- bit fingerprint for each of them and requires only one hash 
function to generate such fingerprints. In order to support enough 
randomness, the hash function should generate uniformly and 
independently distributed fingerprints. 

Each fingerprint / in the algorithm is partitioned into q most significant 
bits (the quotient ) and r least significant bits (the remainder) using 
the quotienting technique, suggested by Donald Knuth 2 . 


D. Knuth, The Art of Computer Programming, Vol. 3: Sorting and Searching, 1973 
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Algorithm 2.7: Quotienting technique 
Input : Fingerprint / 

Output : Quotient f q and remainder f r 
f r i — f mod 2 r 


return f q , f r 


Practically, to improve spatial locality, the QuotientFilter data 
structure is represented by a compact open addressing hash table with 
m = 2 q buckets where the remainder f r is stored in a bucket indexed by 
the quotient f q . Possible collisions are resolved by Linear probing. 

Given a remainder f r in bucket f q , the full fingerprint can be uniquely 
reconstructed as 

f=f q -2 r +f r . 

Each bucket contains three metadata bits, all unset at the beginning: 
is_occupied, is_continuation, and is_shifted — these play 
an important role in navigating the data structure. 

• is_occupied is set when the bucket j is the canonical bucket 
{fg = j) f° r sorne fingerprint / stored somewhere in the filter. 

• is_continuation is set when the bucket is occupied, but not by 
the first of the remainders that belong to the sanie bucket. 

• is_shifted is set when the remainder in the bucket is not in its 
canonical bucket. 

Figure 2.1: Bucket in the Quotient filter 


bucket fq 


is_occupied 

is_continuation 

is_shifted 

fr 


When two different fingerprints / and f' have the same quotient 
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(meaning f q = ), it is a soft collision that can be resolved by the Linear 

probing strategy that we discussed early. In Quotient filter it is 
implemented by storing all the remainders of fingerprints with the sanie 
quotient contiguously in a run. If necessary, a remainder can be shifted 
forward from its original location and stored in a subsequent bucket, 
wrapping around at the end of the array. 

Algorithm 2.8: Using right shift to empty buckets 

Input: Bucket index k 

Input: Quotient filter of length m 

prev 4— QuotientFilter[/c] 

i 4— k + 1 

while True do 

if QF[*] = NULL then 
QF[i] 4— prev[i\ 

QF[i].is_continuation 4— 1 
QF[i].is_shifted 4— 1 
return QF 

else 

curr QF[i] 

QF[i] 4- prev 

QF[i].is_continuation 4— preu.is_continuation 
QF[*].is_shifted 4— pre?;.is_shifted 

prev 4— curr 

preu.is_continuation 4— curr. is_continuation 
preu.is_shifted 4— curr. is_shifted 

i 4- i + 1 

if i > m then 

j ii — 0 


The sequence of one or more consecutive runs with no empty buckets 
in between is called a cluster. All clusters are immediately preceded by 
an empty bucket and the is_shifted bit of its first value is never set. 
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The internal hash table is compactly stored in an array to reduce 
the required memory and achieve better data locality; however, this 
rnakes the navigation through it quite complex. 

Consider a scan function that is designed to find the run. It starts by 
walking backward frorn the canonical bucket for / to find the beginning 
of the cluster. As soon as the cluster’s start is found, it goes forward 
again to find the location of the first remainder for the bucket f q , that is 
the actual start of the run r s t a rt- 


Algorithm 2.9: Scanning the Quotient filter to find the run 
Input: Canonical bucket index f q , Quotient filter 
j fq 

while QF[j].is_shifted = 1 do 

|_ J 3 ~ 1 

^start ^ j 

while j / f q do 

/* skip all elements in the current run and find the next occupied bucket */ 

repeat 

j Ftart ?start + 1 

until QF[r s tart]-is_continuation ^ 1 
repeat 

! j 3 + 1 

until QF[j].is_occupied = 1 

hsnd i Flari. 

repeat 

j t en d i t en d + 1 

until QF[r en d]-is_continuation ^ 1 
return r sta rt, Fnd 


When we want to insert a new element into QuotientFilter, we 
first calculate its quotient and remainder. If the canonical bucket for 
the element isn’t occupied, it can immediately be inserted using 
the insertion procedure given by Algorithm 2.10. Otherwise, before 
insertion, it is necessary to find an appropriate bucket with the scan 
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function from Algorithm 2.9. Once the correct bucket is found, actual 
insertion stili requires the appropriate merging of the f r to the sequence 
of already stored elements, that may need shifting right of subsequent 
elements and updating the corresponding metadata bits respectively. 

With the mentioned selection strategy for the appropriate bucket 
and the right shift function given by Algorithm 2.8, we can formulate 
the complete insertion procedure below. 


Algorithm 2.10: Adding element to the Quotient filter 
Input : Element igD 
Input : Quotient filter with hash function h 
f h(x) 

fqifr t— / 

if QF[/ 9 ].is_occupied / 1 and QF[/ ? ] is ernpty then 
QF \f q }^fr 

QF[/ 9 ].is_occupied ■(— 1 
return True 


QF[/ ? ].is_occupied 1 

r si art, r en d <- Scan(QF ,f q ) 
for i <- r start to r end do 

if QF [z] = f r then 

/* f r already exists */ 

return True 

else if QF[*] > f r then 

/* insert f r in the bucket i and shift others */ 

QF <- ShiftRight(QF, i) 

QF [i\^f r 

return True 

/* the run should be extended with the new element */ 

QF -e- ShiftRight(QF, r en d + 1) 

QF[r en d + 1 ] 4— fr 

return True 
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According to the Linear probing schema, the length of most runs is 0(1) 
and, the authors of the filter noted, it is highly likely that all runs have 
length 0(log m). 


Example 2.9: Add elements to the filter 

Consider a Quotient filter with 16-bit fingerprints produced by the 32-bit 
version of the MurmurHash3 hash function: 


h(x) := MurmurHash3(:r) mod 16. 

For the bucketing we reserve q = 3 most significant bits, hence the size of 
the QuotientFilter is m = 2 3 = 8, and the rest p = 13 bits we store 
into the chosen buckets. 

Like in Example 2.3 we start indexing names of capitals and the first 
element to add into the filter is Copenhagen. We need to compute its 
fingerprint using the hash function h: 

/ = h(Copenhagen) = 4248224207. 


According to Algorithm 2.7, the quotient and remainder are 

/ 


u = 


2 13 


= 7, 


f r =f mod 2 i3 = 490127823. 


The canonical bucket for the element Copenhagen is j = f q = 7 where 
we want to index its remainder f r . Insertion at this point is 
straightforward since all buckets are not occupied and we insert 
f r = 490127823 in the bucket with index 7 and set the is_occupied bit: 

0 1 2 3 4 5 6 7 


0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

1 

0 

0 








490127823 


In the same way, we index the elements Lisbon, that has the fingerprint 
/ = 629555247 and the canonical bucket 1, and Paris with the fingerprint 
/ = 2673248856 and the canonical bucket 4. Since those canonical buckets 
are free, we insert remainders accordingly and set the is_occupied bits: 
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0 1 2 3 4 5 6 7 


0 

0 

0 

1 

0 

0 

0 

0 

0 

0 

0 

0 

1 

0 

0 

0 

0 

0 

0 

0 

0 

1 

0 

0 


92684335 



525765208 



490127823 


Next, we add element Stockholm with fingerprint / = 775943400, getting 
its canonical bucket j = f q = 1 and the remainder f r = 239072488. 
However, the canonical bucket 1 already has its is_occupied bit set, 
meaning that it is already occupied by the remainder of another element 
(element Lisbon in this case). 

Since the is_shifted and is_continuation bits are not set, we are at 
the beginning of the cluster that is also the start of the run. The remainder 
f q is bigger than the already indexed value 92684335, tlrerefore it should be 
stored into the next available bucket, being bucket 2 and those is_shifted 
and is_continuation bits lrave to be set. However, the is_occupied bit 
for bucket 2 remains unchanged, because there is no stored remainder that 
has it as a canonical bucket. 


0 1 2 3 4 5 6 7 


0 

0 
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0 

1 

1 

0 

0 

0 

1 

0 

0 

0 

0 

0 

0 

0 

0 
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0 
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92684335 

239072488 


525765208 



490127823 


run run run 

<.> <.> <.■> 

cluster cluster cluster 


The next element is Zagreb wlrose fingerprint / = 1474643542, canonical 
bucket j = 2 and remainder f r = 400901718. Unfortunately, bucket 2 is 
already in use by some shifted value as is shown by the set is_shifted 
bit, however the is_occupied bit is not set. Thus, the value f r has to be 
shifted right as well, into the next available bucket, which is bucket 3 in 
this case. 
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92684335 

239072488 

400901718 

525765208 



490127823 


run run run run 

<.> <.> <.> <.■> 

cluster cluster cluster 


We set the is_shifted bit to indicate that the bucket contains a value 
shifted from its canonical position, but keep the is_continuation bit 
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unset since it is the first element associated with that canonical bucket. 
Additionally, we set the is_occupied bit for bucket 2 to remember that 
there is at least one stored remainder that has it as its canonical bucket. 

Finally, let’s add element Warsaw with fingerprint / = 567538184, whose 
quotient and remainder are 


u = 


2 13 


= 1 , 


f r =f mod 2 13 = 30667272. 


The canonical bucket j = f q = 1 is already occupied according to the set 
is_occupied bit. However, other bits are not set, meaning that we are at 
the beginning of the cluster, that is also the start of the run. The remainder 
f q is smaller than the indexed value 92684335; thus it should be indexed 
into the canonical bucket, and all other remainders have to be shifted 
and marked as a continuation. In this case, the shifting also affects 
the remainders from other runs, forcing us to shift tlrem as well, set shifted 
bits and rnirror the continuation bits if they were set for their current 
positions. 
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< > < > < > < > 

cluster cluster 

< - > < - > 


Testing for elements can be completed in the same way as insertion. 
We check if the canonical bucket for the tested element has at least one 
associated remainder somewhere in the filter by observing 
the is_occupied bit. If the bit is not set, we can conclude that element 
is definitely not in the filter. Otherwise, we scan the filter using 
the scan procedure given by Algorithm 2.9 to hnd the appropriate run 
for the bucket. Next, within that run, we compare stored remainders 
with the remainder of the tested element taking into account that they 
are all sorted. If such a remainder is found, we can report that 
the element may exist in the filter. 
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Algorithm 2.11: Testing element in the Quotient filter 
Input: Element x G B 

Input: Quotient filter with hash function h 

Output: False if element not found and Time if element may exist 

/ K x ) 

fqifr f 

if QF[/ ? ].is_occupied ^ 1 then 
[ return False 

else 

^start ■ r elu i i Scan(QF ifq) 

/* search for f r within the run */ 

for i 4— r star t to r end do 

if QF[i] = f r then 
[ return Time 

return False 


Example 2.10: Test elements in the filter 

Consider the QuotientFilter data structure that we built in 
Example 2.9: 
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run run run run 

<.■> <.> <.> <.> 

cluster cluster 


Let’s test the element Paris , with quotient f q = 4 and remainder f q = 
525765208 as calculated earlier. Bucket 4 is already occupied, meaning 
there is at least one remainder somewhere in the filter that has it as 
the canonical bucket. We cannot at this point compare the value from 
the bucket with the f r because the is_shifted bit is set and we need to 
find a run that corresponds to canonical bucket 4 in the current cluster. 

Thus, we scan from bucket 4 to the left and count buckets with set 
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is_occupied bits until we reach the start of the cluster. In our example, 
the cluster starts at bucket 1 , and there are two occupied buckets (buckets 1 
and 2) located left of bucket 4. Therefore, our run is the third in the cluster 
and we need to scan right frorn the beginning of the cluster (bucket 1 ) until 
we reach that run by counting buckets witlr unset is_continuation bits. 
Finally, we hnd that the run starts within bucket 5, and start comparing 
stored remainders, taking into account that they are sorted in ascending 
order. 

The value in bucket 5 exactly matclies the remainder f q = 525765208, thus 
we can conclude that element Paris may exist in the filter. 


Deletions in a Quotient filter are handled in a very similar way to 
the addition of a new element. However, since all remainders of 
fingerprints with the same quotient are stored contiguously according to 
their numerical order, removal of a remainder from the cluster must 
shift all fingerprints to fili the “empty” entry after deletion and rnodify 
the metadata bits respectively. 


Algorithm 2.12: Using left shift to fili empty buckets 

Input: Bucket index k 

Input: Quotient filter of length m 

i <— k + 1 

while QF[i] 7 ^ NULL do 
QF[*-1] <— QF[*] 

QF[i - l].is_continuation •(— QF[i].is_continuation 
QF[i - l].is_shif ted QF[i].is_shifted 

QF[*] •<— NULL 
QF[i].is_continuation -e- 0 
QF[*].is_shifted <(— 0 
i <— i + 1 
if i > m then 
| * <- 0 


Firstly, it is necessary to check if the canonical bucket is already 
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occupied, otherwise the element is definitely not in the filter and we can 
stop here. Afterward we use the scan procedure to find the appropriate 
bucket and delete the requested element (if it exists) and shift 
the subsequent elements and update the corresponding metadata bits. 
Note, that if the deleted remainder was the last for its canonical bucket, 
we also unset the is_occupied bit. 

Algorithm 2.13: Deleting element frorn the Quotient filter 

Input: Element x G B 

Input: Quotient filter with hash function h 

Output: False if element not found and True otherwise 

/ •<- h{x) 

fqifr 4— f 

if QF[/ ? ].is_occupied / 1 then 
return True 

^startj ^end i ScUU f O K . fq ) 
for i 4— r st art to r en d do 
if QF[i] = f r then 

/* element found and can be deleted */ 

QF[*] ^ NULL 
if t start = r en d then 
|_ Q F W .is occupied •(— 0 

else if i < r en d then 
L QF ^ ShiftLeft(QF, i + 1) 

return True 
return False 


Properties 

False positives are possible. The Quotient filter data structure is 
a compact representation of a multi-set of fingerprints, and its false 
positive rate is a function of the hash function h and the number of 
elements n added into the filter. 
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Moreover, two different elements could have the sanie values for both 
the remainder and quotient, that is called a hard collision. Due to such 
extremely rare events, it is possible that false positive responses can 
occur and their probability Pf p is upper bounded by 

Pfp^l (2.6) 

The formula (2.6) shows that, under the hxed number of expected 
elements n, there is a trade-off between the probability of false positives 
Pfp and the length of the fingerprint p. 

Practical implementations of Quotient filters use 32- and 64-bit fingerprints. 

Similar to other hash tables, the load factor is very important in 
the Quotient filter and we want to allocate at least as rnany buckets as 
we expect elements, meaning we choose the number of buckets m as 

m := 2 q > n, (2.7) 


and the length of the remainder r can be calculated frorn (2.6) as 

r= f lo 4^'M^r7))l' <2 ' 8) 

False negatives are not possible. Same as with other data 
structures in this chapter, if the Quotient filter finds an element is not 
a rnember, then it is definitely not a rnember of the set: 

Pfn = 0. (2.9) 

The Quotient filter is about 20% bigger than the Bloom filter, but 
faster because each access requires evaluating only a single hash function 
and all data are stored in contiguous blocks. Tests in a Quotient filter 
incur a single cache miss, as opposed to at least two in expectation for 
the Bloom filter algorithm. 
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Results of in-RAM performance comparisons in [Bel2] show that Quotient 
filters can handle 2.4 million inserts per second while Bloom filters are 
limited to about 0.69 million. However, with tests for elements, they are 
almost at the same level of about 2 million per second. 


Example 2.11: Required memory 

As stated in (2.7), to handle 1 billion elements, the Quotient filter has to 
contain at least 2 30 buckets, meaning we cannot use fingerprints shorter 
than 31 bits. If we want to keep the probability of false positive events 
at about 2%, the nurnber of bits for the remainder can be found from 
the formula (2.8) as 


10 9 

106 I - 2=0 


1 


ln (1 - 0.02) 


= 6 . 


Therefore, the required length of the fingerprint is p = q + r = 36 bits, 
where the first 30 bits are used for the bucketing and the rest 6 bits are 
stored in the appropriate bucket. Since every bucket additionally contains 
three metadata bits, the total size of the Quotient filter is 9 • 2 30 bits that 
is roughly 1.2 GB of memory. 


The Quotient filter can restore fingerprints from the stored data, 
therefore, support deletion, merging, and resizing. The merge doesn’t 
affect the false positive rates of the filters and deletion in a Quotient filter 
is always correct in contrast to the Counting Bloom filter that supports 
only probabilistically correct deletions. 

Resizing of the Quotient filter (both, shrunk and expanded) can be 
done by iterating over the filter and copying each fingerprint into 
a newly allocated data structure without the need to re-hash. Two or 
more Quotient filters can be merged using an algorithm similar to 
the merge sort, the divide-and-conquer sorting algorithm invented by 
John von Neumann. Thus, all input filters can be scanned in parallel 
and the merged resuit is written to the output filter. 

The time required to perform a test, addition or deletion in a Quotient 
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filter is dominated by the time to scan backward and forward. 

However, the Quotient filter is designed with a focus on big data (e.g., 
1 billion elements for 64-bit hash function) and for small- or medium-sized 
datasets its complexity can diminish the benefits. 


2.4 Cuckoo filter 

Most modifications of the classical Bloom filter that support deletions 
degrade either in space or performance. In order to handle this problem 
Bin Fan, David Anderson, Michael Kaminsky, and Michael Mitzenmacher 
in 2014 [Fal4] proposed the Cuckoo filter which was a compact variant 
of the Cuckoo hash table that we discussed earlier, but adjusted to store 
only fingerprints of some length p for each inserted element, instead of 
key-value pairs. 

Cuckoo filters are easier to implement, they support dynamic 
additions and deletions, while using less space and achieving even higher 
performance than other Bloom filter modifications in many practical 
applications. 

The CuckooFilter data structure is represented as a multi-way 
associative Cuckoo hash table with m buckets each of which can store 
up to b values. However, with the Standard Cuckoo hashing, to insert 
a new element, it is necessary to access the original existing elements in 
order to determine where to relocate stored values if space is needed for 
new ones. However, the Cuckoo filter only Stores fingerprints and there 
is no way to restore the original elements and re-hash them to hnd their 
new bucket in the hash table. 

With the purpose of overcoming this limitation and stili employing 
the Cuckoo hashing, the Cuckoo filter algorithm uses the Partial-Key 
Cuckoo hashing, which allows the new bucket of the existing element to 
be derived by its hngerprint without knowing the original element itself. 

According to that schema, for each element x to be inserted, 
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the algorithm computes a p-bit fingerprint / and the indexes of two 
candidate buckets as follows: 


i = h(x) mod m, 
j = (i® (/i(/) mod m)) mod m. 


( 2 . 10 ) 


In order to distribute elements uniformly in the hash table, 
the fingerprint / is additionally hashed with a hash function h before 
the XOR calculation in formula (2.10). 


When the fingerprints length p is small compared to the filter length m, 
the XOR operation alters only that small number of lower bits, but most 
of the higher order bits stay the same. This simply implies that elements 
shifted from tlreir primary buckets tend to be found close to each otlrer in 
their alternate buckets and distribution in the hash table is going to be 
skewed which influences the efficiency of the filter. 

Hashing the fingerprints ensures that tliese elements are relocated to buckets 
in entirely different parts of the hash table, lrence reducing hash collisions 
and improving the table utilization. 

The exclusive disjunction (XOR) operation © in the formula (2.10) 
ensures an important property, that by knowing the current element’s 
buckets k it is possible to compute its alternate bucket k* without 
restoring the original element: 

k* = (k © h(f)) mod m. (2.11) 

To add a new element x into the Cuckoo filter, we compute indices 
for two candidate buckets with (2.10). If at least one of those buckets is 
empty, we insert the element into that bucket. Otherwise, we randomly 
choose one of those buckets and store element x there, while moving the 
element from that bucket to its alternative candidate bucket using (2.11). 
We repeat this procedure until an empty bucket is found, or until a 
maximum number of displacements is reached. If there are no empty 
buckets, the filter is considered full. 
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Algorithm 2.14: Adding element to the Cuckoo filter 
Input : Element x £ B 

Input: Cuckoo filter with fingerprinting and hash function h 
□utput: True if element has been added and False otherwise 
/ 4— fingerprint (r) 
i 4— h(x) 

j <- * © h(f) 

if CuckooFilter[ 1] has empty space then 
CuCKOOFiLTER[i].add(/) 
return True 

else if CuckooFilter[j] has empty space then 
CuCKOOFlLTER[j].add(/) 
return Time 

k 4— sample({i, j}) 
for n 4— 0 to Maxlter do 

x <— sample(CuCKOOFlLTER[A:]) 

swap / and the fingerprint stored in entry x 

k = k © h(f) 

if CuckooFilter[A:] has empty space then 
CuckooFilter[A:] ,add(/) 
return True 

return False 


Example 2.12: Add elements to the filter 

Consider a CuckooFilter data structure of length m = 8 that, for 
simplicity, Stores only one p = 16-bit fingerprint per bucket. We use 
a single 32-bit MurmurHash3 hash function to compute the fingerprints 
and the bucket indices. 

Similar to other examples, we insert the names of capital cities, starting 
with element Copenhagen, whose p-bit fingerprint is 


/ = MurmurHash3( Copenhagen) mod 2 P = 49615. 
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When its primary bucket i according to the formula (2.10) is 

i = MurmurHash3 (Copenhagen) mod m = 7, 

and the alternate bucket j can be derived from i and the fingerprint / as 
follows: 


j = (i® MurmurHash3(/)) mod m = (1 © 34475545) mod 10 = 0. 


Thus, we can index the fingerprint / into either bucket 7 or 0, and, since 
the filter is empty, we use the primary bucket: 

0123456789 

49615 


Similarly, we index the element Athens with fingerprint / = 27356 and 
the candidate buckets 0 and 7. The primary bucket 0 isn’t occupied and 
allows the storing of the fingerprint freely: 


0123456789 


27356 


49615 


Consider element Lisbon, whose fingerprint is / = 16431 and the candidate 
buckets are 7 and 9. We start with the primary bucket 7, but it is already 
occupied in the CuckooFilter and it is at its maximum capacity of 
one, hence we check the alternate bucket 9, which is empty, and we store 
the fingerprint there: 


0123456789 


27356 


49615 


16431 


Next, consider the element Helsinki. It has fingerprint / = 15377 and 
botlr bucket indices equal to 7. Note, that such an index collision is more 
likely for small filters, as we have in this example, then for real ones. 
Bucket 7 is occupied and cannot accept more than one element, therefore 
we need to start the relocation procedure in the filter. First, we start with 
the bucket k and swap the value 49615 from bucket 7 with the value /, 
then, relocate that value to a new bucket k that is derived from it by 
formula (2.11): 


k = (7 ® MurmurHash3(49615)) mod 10 = 0. 
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o 

27356 


7 8 9 

15377 16431 


-L 


J 


Unfortunately, bucket 0 already contains value 27356 and we swap it with 
the 49615, and need to compute the new bucket index for it: 

k = (0 © MurmurHash3(27356)) mod 10 = 7. 


We are back to bucket 7 which isn’t empty and we are required to repeat 
the relocation procedure once again. First, we store value 27356 in 
the bucket, and then compute a new bucket for the value 15377: 

k = (7 © MurmurHash3(15377)) mod 10 = 7. 


Due to the index collision that we mentioned earlier for that fingerprint, 
we return to bucket 7 again and store value 15377 in it while relocating 
the value 27356 to a new bucket k: 

k = (7 © MurmurHash3(27356)) mod 10 = 2. 


Fortunately, bucket 2 is empty, so we can store value 27356 and finish 
the insertion procedure. 


The testing of element existence in the filter is straightforward. First, 




























































54 


Chapter 2: Membership 


for the tested element, we compute the fingerprint and its candidate 
buckets. If the fingerprint is present in either bucket, we conclude that 
the element may exist. Otherwise, it is definitely not in the filter. 


Algorithm 2.15: Testing element in the Cuckoo filter 
Input : Element x E B 

Input : Cuckoo filter with fingerprinting and hash function h 
Output : False if element not found and True if element may exist 
/ t— fingerprint [x) 
i <— h(x) 
j * ® h(f) 

if / E CuckooFilter[*] or / e CuckooFilter[j] then 
return True 

return False 


Example 2.13: Test elements in the filter 

Consider the CuckooFilter data structure that we built in Example 2.12: 


0123456789 


49615 


27356 


15377 


16431 


Let’s test the element Lisbon whose candidate buckets are 7 and 9 and 
fingerprint is / = 16431 as we computed above. We can find value 16431 
in bucket 9 and conclude that the element Lisbon may exist in the filter. 

In contrast, consider the element Oslo, that has fingerprint / = 53104 
and candidate buckets 0 and 6. As we can see, there is no such value in 
these buckets, therefore element Oslo is definitely not in the filter. 


In order to delete an element, we build the fingerprint, then compute 
the indices of the candidate buckets and lookup for the fingerprint 
there. If it matches any existing values in either bucket, one copy of 
the fingerprint is removed from that bucket. 
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Algorithm 2.16: Deleting element from the Cuckoo filter 
Input : Element x G D 

Input : Cuckoo filter with fingerprinting and hash function h 
Output : True if element has been deleted and False otherwise 
/<— fingerprint(r) 
i <— h{x) 
j * © h(f) 

if / e Cuckoo Filter [i] then 
CUGKOOFlLTER[i] ,drop(/) 
return True 

else if / € CuckooFilter[j] then 
CuckooFilter[j] .drop(/) 
return True 

return False 


Properties 

When the Cuckoo filter needs to support deletion, it must store multiple 
copies of the same value or arrange counters for each stored value. 
However, both approaches induce probabilistically correct, one - due to 
limited bucket capacity (we cannot store more than 2 b same values in 
the table), and another because the counters overfiow, as it was explained 
in Counting Bloom filter. However, a non-deletable Cuckoo filter doesn’t 
have this problem and much more space efficient because it doesn’t need 
to remember identical values that were added multiple times. 

False positives are possible. It is possible that different elements 
could share the same fingerprint, but in most cases they have different 
candidate buckets, thus can stili be differentiated. However, when 
the candidate buckets are also the same for those elements, a hard 
collision occurs. Due to such extremely rare events, the filter can end up 
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in false positive responses and their probability Pf p is 

( l\ 2b 2b , 

(2 ' 12) 

The formula (2.12) shows that, under the fixed number of expected 
elements n, there is a trade-off between the probability of false positives 
Pfp and the bucket size b , that can be compensated by the length of 
the hngerprints p. Intuitively, if the hngerprints are sufficiently long, 
Partial-Key Cuckoo hashing is a good approximation to Standard Cuckoo 
hashing, but longer hngerprints affect the required space. 

Thus, the recommended hngerprint length p can be estimated as 


p > 



(2.13) 


and if we want to store in m buckets of size b at least as many values as 
the number of input elements, the length of the hlter is lower bounded 
by 


m > 



(2.14) 


False negatives are not possible. Similar to others, if the Cuckoo 
hlter hnds that an element is not a member, then it is dehnitely not 
a member of the set: 


Pfn = 0. 


(2.15) 


Cuckoo hlters ensure high space occupancy because they rehne earlier 
element-placement decisions when adding new elements. However, they 
have a maximum capacity, which is expressed as a load factor a. After 
reaching the maximum feasible load factor, insertions are non-trivially 
and increasingly likely to fail, hence the hash table must expand to store 
more elements. 
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The average number of bits per element p can be defined as the ratio 
between the length of the fingerprints and the load factor a that, under 
the fixed false positive probability Pf p , can be estimated as 




(2.16) 


Since Cuckoo hashing schema uses two hash functions, the load factor with 
buckets of size 6 = 1 is 50% as the hash table is directly mapped. However, 
increasing the bucket size allows to improve table occupancy, for instance, 
for b = 2 and 6 = 4 the load factors are 84% and 95%, correspondingly. 

The experimental study [Fal4] showed that buckets of size b £ {1,2, 3,4} 
are enough for practically important cases. 


Example 2.14: Required memory 

For instance, we want to handle 1 billion elements with a Cuckoo filter, 
keeping the probability of false positive events at about 2% and table 
occupancy at 84%. To support such load factor we choose the size of 
buckets as b = 2, meaning the length of the filter m = 2 29 , according 
to (2.14). 

As stated in (2.13), the minimal fmgerprint length is 


P = 



= 8 . 


Therefore, the required length of the fingerprint is 8 bits and the total size 
of the Cuckoo filter is 2 • 8 • 2 29 bits that is roughly 1.07 GB of memory. 

To compare the space requirements with other studied filters, we can use 
6 = 1 that achieves 50% of table occupancy and requires the filter of length 
m = 2 30 . According to (2.13), we need to use 9-bits fingerprints, that 
results in about 0.94 GB of memory. 


In fact, Cuckoo filters use a similar approach as the d-left Counting 
Bloom filter [Bo06], but they achieve better space efficiency and much 
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more straightforward to implement. For applications that store many 
elements and target moderately low false positive rates (less than 3%), 
Cuckoo filters offer lower space overhead than even space-optimized 
Bloorn filters. 

However, when the Cuckoo filter is at its maximum capacity, 
the underlying hash table must be extended, until then it is not possible 
to add new elements. In contrast, with the Bloorn filter it is stili 
possible to keep inserting new elements at the cost of the increased false 
positive rate. 


Conclusion 

This chapter is dedicated to membership problems and we learned how 
traditional hash tables can be replaced or extended to be practically 
applicable for big data handling. We have learned the rnost well-known 
probabilistic data structure called the Bloorn filter and discussed its 
strength and weaknesses, then considered its modihcations that are 
widely used in practice. Additionally, we studied the modera alternatives 
that have better data locality, support more operations, and are tuned 
for good performance for large datasets. 

If you are interested in more information about the material covered 
here or want to read the original papers, please take a look at the list of 
references that follows this chapter. 

In the next chapter we study the problem of determining the number 
of distinet elements in a dataset, that can be challenging for big data 
and also requires probabilistic approaches to be efficiently solved. 
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3 

Cardinality 


The cardinality estimation problem is a task to find the number of distinet 
elements in a dataset where duplicates are present. Traditionally, to 
determine the exact cardinality of a set, classical methods build a list 
of all elements, and use sorting and search to avoid listing elements 
multiple times. Counting the number of elements in that list gives 
the accurate number of the unique elements, but it has a time complexity 
of 0(Ndog N), where N is the number of all elements including duplicates, 
and requires auxiliary linear memory, that is unlikely to be feasible for 
Big Data applications that operate huge datasets of large cardinalities. 
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Example 3.1: Unique visitors 

One of the valuable KPIs for any website is the number of unique visitors 
that have visited it over a specified period of time. For simplicity, we 
assume that unique visitors use different IP addresses, therefore we need 
to calculate the number of unique IPs which according to IPv6 Internet 
Protocol version are represented by 128-bit strings. Is this an easy task? 
Can we just use the classical methods to count the number exactly? That 
depends on the popularity of the website. 

Consider traffic statistics for March 2017 of the top three most popular 
retail websites in the United States: amazon.com, ebay.com and 
walmart.com. According to SimilarWeb 1 , the average number of visits to 
those websites was about 1.44 billion and the average number of pages 
viewed per visit was 8.24. Therefore, the statistics for March 2017 include 
about 12 billion IP addresses at 128-bit each, meaning a total size of 
192 GB. 

If we assume that every lOth of those visitors was unique, we can expect 
cardinality of such a set at about 144 million and the memory required to 
store the list of unique elements is 23 GB. 


Another example illustrates the challenge of cardinality estimation for 
scientific researches. 


Example 3.2: DNA analysis (Giroire, 2006) 

One of the long-standing tasks in human genome research is to study 
correlations in DNA sequences. DNA molecules include two paired strands, 
each made up of four Chemical DNA-base units, marked A (adenine), 
G (guanine), C (cytosine), and T (thymine). The human genome contains 
about 3 billion such base pairs. Sequencing means determining the exact 
order of the base pairs in a segment of DNA. 

Frorn a mathematical point of view, a DNA sequence can be considered 
a string of symbols A, G, C, T which can be as long as you want, and we 
can consider them as an example of a potentially infinite dataset. 

The correlation measuring problem can be formulated as a task of 


1 Traffic OverView https : //www. similarweb. com/website/amazon. com?competitors=ebay. com 
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determining the number of distinet substrings of some fbced size in a piece 
of DNA. The idea is that a sequence with a few distinet substrings is 
more correlated than a sequence of the same size but with more distinet 
substrings. 

Such experiments demand multiple runs on many huge files and to speed 
up the research they require only limited or even constant memory and 
small exeeution time, which is unfeasible with exact counting algorithms. 


Thus, the possible gains of the accurate cardinality estimation are 
neglected by large time processing and memory requirements. Big Data 
applications shell use more practical approaches, rnostly based on various 
probabilistic algorithms, even if they can provide only approximated 
answers. 


While processing data, it is important to understand the size of the dataset 
and the number of possible distinet elements. 

Consider the potentially infinite sequence of 1-letter strings o, d , s, ..., 
which is based on letters from the English alphabet. The cardinality can be 
easily estimated and it is upper bounded by the number of letters, which 
is 26 in the modern English language. Obviously, in this case, there is 
no need to apply any probabilistic approach and a naive dictionary-based 
solution of exact cardinality calculation works very well. 

To approach the cardinality problem, many of the popular probabilistic 
methods are influenced by the ideas of the Bloom filter algorithm, they 
operate hash values of elements, then observe common patterns in their 
distribution, and make reasoned “guesses” about the number of unique 
elements without the need to store all of them. 


3.1 Linear Counting 

As a first probabilistic approach to the cardinality problem, we consider 
the linear-time probabilistic counting algorithm, the Linear Counting 
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algorithm. The original ideas were proposed by Morton Astrahan, Mario 
Schkolnick, and Kyu-Young Whang in 1987 [As87] and the practical 
algorithm was published by Kyu-Young Whang, Brad Vander-Zanden, 
and Howard Taylor in 1990 [Wh90]. 

The immediate improvement to the classical exact methods was to 
hash elements with some hash function h , which out-of-the-box can 
eliminate duplicates without the need to sort elements with a payout of 
introducing some probability of error due to possible hash collisions (we 
cannot distinguish duplicates and “accidental duplicates”). Thus, using 
such a hash table, only a proper scan procedure is required to implement 
a simple algorithm that already outperforms the classical approach. 

However, for datasets with huge cardinalities, such hash tables could 
be quite large and require memory that grows linearly with the number 
of distinet elements in the set. For systems with limited memory, it 
will require disk or distributed storage at some point, which drastically 
reduces the benefits of hash tables due to slow disk or network access. 

Similar to the Bloom filter idea, to work-around such an issue, 
the Linear Counting algorithm doesn’t store the hash values themselves, 
but instead their corresponding bits, replacing the hash table with a bit 
array Linear Counter of length m. It is assumed that m stili is 
proportional to the expected number of distinet elements n, but requires 
only 1 bit per element which is feasible for rnost cases. 

In the beginning, all bits in LinearCounter are equal to zero. To 
add a new element x into such a data structure, we compute its hash 
value h(x) and set the corresponding bit to one in the counter. 

Algorithm 3.1: Adding element to the Linear counter 
Input : Element iED 

Input : Linear counter with hash function h 
j •<- h(x) 

if LinearCounter[j] = 0 then 
j LinearCounter[j] 1 
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Since only one hash function h is used, we can expect many additional 
hard collisions when two different hash values set the same bit in the array. 
Thus, the exact (or even near-exact) number of distinet elements can no 
longer be directly obtained frorn such a sketeh. 

The idea of the algorithm leads to distributing elements into buckets 
(bits indexed by hash values) and keeps a LinearCounter bit array 
indicating which buckets are hit. Observing the number of hits in 
the array leads to the estimate of the cardinality. 

In the first step of the Linear Counting algorithm, we build our 
LinearCounter data structure as is shown in Algorithm 3.1. Having 
such a sketeh, the cardinality can be estimated using the observed 
fraction of empty bits V by the formula: 

n ~ —m • ln V. (3.1) 

We see clearly now how collisions impact on the cardinality estimation 
in the Linear Counting algorithm — each collision reduces the number 
of bits that have to be set, making the observed fraction of unset bits 
bigger than the real value. If there were no hash collisions, the final 
count of set bits would be the desired cardinality. However, collisions 
are unavoidable and the formula (3.1) actually gives an overestimation 
of the exact cardinality and, since the cardinality is an integer value, we 
prefer to round its resuit to the nearest smaller integer. 

Thus, we can formulate the complete counting algorithm as below. 


Algorithm 3.2: Estimating cardinality with Linear Counting 

Input: Dataset B 

Output: Cardinality estimation 

LlNEARCOUNTER[i] 0, * = 0 . . . m - 1 

for x e B do 

| Add(e, LinearCounter) 

Z count (LlNEARCOUNTER[i] = 0) 

Z=0...771—1 

retura \—m • ln — I 

L m -1 
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Example 3.3: Linear Counting algorithm 

Consider a dataset that contains 20 names of capital cities extracted 
from recent news articles: Berlin , Berlin, Paris , Berlin, Lisbon, Kiev, 
Paris, London, Rome, Athens, Madrid, Vienna, Rome, Rome, Lisbon, 
Berlin, Paris, London, Kiev, Washington. 

For such srnall cardinalities (actual cardinality is 10) to ha ve a Standard 
error about 10% we need to choose the length of the LinearCounter 
data structure at least as the expected nurnber of unique elements, thus 
let’s choose m = 2 4 . As the haslr function h with values in {0,1, , 2 4 — 1} 
we use a function based on 32-bit MurmurHash3 defined as 

h(x) := MurmurHash3(:r) mod m, 

and cities hash values can be found in the table below. 


As we can see, the cities London and Madrid share the same value, but 
such collisions are expected and completely natural. The LinearCounter 
data structure has the following view: 


0 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

0 

1 

0 

0 

0 

0 

1 

1 

l 

0 

0 

i 

1 

1 

1 

1 


According to the Linear Counting algorithm, we calculate the fraction V 
of empty bits in the LinearCounter: 

V = — = 0.5625 
16 

and the estimated cardinality is 

n ~ -16 • ln 0.5625 « 9.206, 
which is pretty close to the exact number 10. 


City 

h(City) 

Athens 

12 

Berlin 

7 

Kiev 

13 

Lisbon 

15 

London 

14 


City 

h(City) 

Madrid 

14 

Paris 

8 

Rome 

1 

Vienna 

6 

Washington 

11 
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Properties 

If the hash function h can be computed in a constant time (which is true 
for the most popular hash functions), the time to process every element is 
a hxed constant O(N), where N is the total number of elements, including 
duplicates. Thus, the algorithm has O(N) time complexity. 

As for any other probabilistic algorithm, there is a number of 
parameters that can be tuned to influence its performance. 

The expected accuracy of the estimation depends on the bit array 
size m and its ratio to the number of distinet elements a = called 
the load factor. Unless a > 1 (m > n is not a practically interesting 
case), there is a non-zero probability Pf u ii that LinearCounter bit 
array becomes fu.ll, called the fill-up probability , that fatally distorts 
the algorithm and blows up the expression (3.1). The probability P f u u 
depends on the load factor and, consequently, on the size m that should 
be selected big enough to have the hll-up probability negligible. 

The Standard error 8 is a measure of the variability of the estimate 
provided by Linear Counting and there is a trade-off between it and 
the bit array size m. Decreasing the Standard error results in more 
precise estimates, but increases the required mernory. 

Table 3.1: Trade-off between accuracy and bit array size 


n 

m 

8 = 1% 

8 = 10% 

1000 

5329 

268 

10000 

7960 

1709 

100000 

26729 

12744 

1000000 

154171 

100880 

10000000 

1096582 

831809 

100000000 

8571013 

7061760 


The dependency on choosing m is quite complex and has no analytical 
solution. However, for a widely acceptable hll-up probability P f u a = 0.7% 
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the algorithm authors have provided precomputed values that are given 
in Table 3.1 and can be used as references. 

Since the fill-up probability is ne ver zero, the bit array very rarely 
becomes full and distorts Algorithm 3.2. When working with small 
datasets, we can re-index all elements with a different hash function or 
increase the size of LinearCounter. Unfortunately, such Solutions 
won’t work for huge datasets and, together with quite high time 
complexity, require a search for alternatives. 

However, Linear Counting performs very well when the cardinality 
of the dataset being measured is not extremely big and can be used to 
improve other algorithms, developed to provide the best possible behavior 
for huge cardinalities. 

In the Linear Counting algorithm, the estimation of the cardinality 
is approximately proportional to the exact value, this is why the term 
“linear” is used. In the next section, we consider an alternative algorithm 
that could be classified as “logarithmic” counting since it is based on 
estimations that are logarithms of the true cardinality. 


3.2 Probabilistic Counting 

One of the counting algorithms that is based on the idea of observing 
common patterns in hashed representations of indexed elements is a class 
of Probabilistic Counting algorithms invented by Philippe Flajolet and 
G. Nigel Martin in 1985 [F185]. 

As usual, every element is pre-processed by applying a hash function h 
that transforms elements into integers sufhciently uniformly distributed 
over a scalar range {0,1,..., 2 M - 1} or, equivalently, over the set of 
binary strings 2 of length M: 

M-l 

h(x) = i = ^2 i k ■ 2 k \= (*o*i... «m- i) 2 ,4 e {0, !}• 
k =0 


We use the “LSB 0” numbering scheme and start at zero for the least significant bit (LSB) 
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Flajolet and Martin noticed that patterns: 


k times 


0 fc l :=00...01 

should appear in such binary strings with probability and, if 


recorded for each indexed element, can play the role of a cardinality 
estimator. 

Every pattern can be associated with its index, called rank, that is 
calculated by the formula: 



min k, for i > 0, 
M for i = 0 


(3.2) 


and sirnply equi valent to the left-most position of 1, known as the least 
significant 1-bit position. 

Example 3.4: Rank calculation 

Consider an 8-bit long integer number 42 that has the following binary 
representation using the “LSB 0” numbering scheme: 

42 = 0-2° + l-2 1 +0-2 2 + l-2 3 + 0-2 4 + l-2 5 +0-2 6 + 0-2 7 = (01010100) 2 . 

Thus, the ones appear at positions 1, 3, and 5, therefore, according to 
the definition (3.2), the rank(42) is equal to: 


rank(42) = min(l,3,5) = 1. 


The occurrences of the 0 fc l pattern, or simply rank(-) = k, in binary 
representations of hash values of each indexed element, can be compactly 
stored in a simple data structure Counter, also known as a FM Sketch , 
that is represented as a bit array of length M. 

At the start, all bits in Counter are equal to zero. When we need to 
add a new element x into the data structure, we compute its hash value 
using the hash function h, then calculate rank (a;) and set 
the corresponding bit to one in the array, as stated in the algorithm 
below. 
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Algorithm 3.3: Adding element to simple counter 
Input : Element x £ ID 

Input : Simple counter with hash function h 
j <— rank {h(x)) 
if Counter[j] = 0 then 
j Counter^'] -e- 1 


In this way, the one in the Counter at some position j means that 
the pattern CPl has been observed at least once amongst the hashed 
values of all indexed elements. 


Example 3.5: Build a simple comiter 

Consider the same dataset as in Example 3.3 that contains 20 names 
of capital cities extracted from recent news articles: Berlin, Berlin, 
Paris, Berlin, Lisbon, Kiev. Paris, London, Rome, Athens, Madrid, 
Vienna, Rome, Rome, Lisbon, Berlin, Paris, London, Kiev, Washington. 

As the hash function h we can use 32-bit MurmurHash3, that maps elements 
to values from {0,1,..., 2 32 - 1}, therefore we can use a simple counter 
Counter of length M = 32. Using the hash values already computed in 
Example 3.3 and the definition (3.2), we calculate ranks for each element: 


City 

h(City) 

rank 

Athens 

4161497820 

2 

Berlin 

3680793991 

0 

Kiev 

3491299693 

0 

Lisbon 

629555247 

0 

London 

3450927422 

1 

Madrid 

2970154142 

1 

Paris 

2673248856 

3 

Rome 

50122705 

0 

Vienna 

3271070806 

1 

Washington 

4039747979 

0 


Thus, the Counter has the following form: 
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0 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

1 

1 

1 

1 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

16 

17 

18 

19 

20 

21 

22 

23 

24 

25 

26 

27 

28 

29 

30 

31 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 


Let’s stress a very interesting theoretical observation. Based on 
the uniform distributiori of the values, if n is the exact number of 
the distinet elements indexed so far, then we can expect that one in the 
first position can appear in about j cases, in the second position in 
about ^ cases, and so on. Thus, if j 3> log 2 n , then the probability of 
discovering one in the j -th position is close to zero, hence 
the Counter[j'] will almost certainly be zero. Similarly, for j -C log 2 n 
the Counter[j] will almost certainly be one. If value j is around 
the log 2 n, then the probability to observe one or zero in that position is 
about the same. 


Thus, the left-most position R of zero in the Counter after inserting 
all elements from the dataset can be used as an indicator of log 2 n. In 
fact, a correction factor cp is required and the cardinality estimation can 
be done by the formula: 

n « —2 r , (3.3) 

9 


where cp ~ 0.77351. 


Flajolet and Martin have chosen to use the least significant 0-bit position 
(the left-most position of 0) as the estimation of cardinality and built their 
algorithm based on it. However, from the observation above we can see, 
that the most significant 1-bit position (the right-most position of 1) can 
be used for the same purpose; however, it has a flatter distribution that 
leads to a bigger Standard error. 

The algorithm to compute the left-most position of zero in a simple 
counter can be formulated as follows. 
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Algorithm 3.4: Computing the left-most zero position 
Input : Simple counter of length M 
Output : The left-most position of zero 
f or j ■(— 0 to M - 1 do 

if Counter[j] = 0 then 
[ return j 

return M 


Example 3.6: Cardinality estimation with simple counter 
Consider the COUNTER from Example 3.5 and compute the estimated 
number of distinet elements. 


0 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

1 

1 

1 

1 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

16 

17 

18 

19 

20 

21 

22 

23 

24 

25 

26 

27 

28 

29 

30 

31 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 


Using Algorithm 3.4, in the Counter the left-most value 0 appears in 
position R = 4, therefore, according to the formula (3.3), the cardinality 
estimation is 


n 


1 


0.77351 


20 . 68 . 


The exact cardinality of the set is 10, meaning the computed estimation 
has a huge error due to the fact that the values of R are integers and for 
very close ranks we can obtain results that differ in some binary orders 
of magnitude. For instance, in our example, R = 3 would give an almost 
perfect estimation of 10.34. 


Theoretically, the cardinality estimation based on a single simple counter 
can provide very close expected values, but it has quite a high variance that 
usually corresponds, as we also observed in Example 3.6, to the unpractical 
Standard error 8 of one binary order of magnitude. 

Obviously, the weakness of the one-counter approach is that there is 
a lack of highly confident estimations for the cardinality (in fact, it makes 
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its prediction based on a single estimation only). 

Thereby, the natural extension of the algorithm is to have many simple 
counters and, consequently, increase the number of estimations. The final 
prediction n can be obtained by averaging the predictions R*, from those 
counters { Counter*, }£Uq. 

Thus, the modified formula (3.3) of the Probabilistic Counting 
algorithm has the forni: 

m— 1 

i s i b. E R k 

n ps _2 r = -2 ^=0 , (3.4) 

9 9 

and the cardinality n will have the same-quality estimated value, but 
with a rnuch smaller variance. 

The obvious practical disadvantage to building m independent simple 
counters is the requirement to compute values of m different hash 
functions that, given that a single hash function can be computed in 
0(1), has O(m) time complexity and quite high CPU costs. 

The solution to optimizing the Probabilistic Counting algorithm is to 
apply a special procedure, called stochastic averaging , when m hash 
functions are replaced by only one but its value split by quotient and 
remainder, which are used to update a single counter per element. 
The remainder r is used to choose one out of m counters and quotient q 
to calculate the rank and find the appropriate index to be updated in 
that counter. 


Algorithm 3.5: Using stochastic averaging to update counters 
Input : Element x € B 

Input : Array of m simple counters with hash function h 
r <— h(x) mod m 


q •<— h(x) div m := 


j rank( q ) 

if CoUNTER r [7] = 0 then 
[ COUNTER r [7] 1 
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Applying the stochastic averaging Algorithm 3.5 to the Probabilistic 
Counting, under the assumption that quotient-based distribution of 
elements is fair enough, we may expect that ^ elements have been indexed 
by each simple counter {Counter^I^Tq , therefore the formula (3.4) is 
a good estimation for — (not n directly): 


n 


m— 1 

m a rn ™ 

— 2 k = —2 k =° 

9 9 


(3.5) 


Algorithm 3.6: Flajolet-Martin algorithm (PCSA) 

Input: Dataset D 

Input: Array of m simple counters with hash function h 
Output: Cardinality estimation 
for igDdo 

r h(x) mod m 
q <— h(x) div m 
j <r- rank(g) 

if CoUNTER r [)] = 0 then 
COUNTER r [)] •(- 1 

S ^ 0 

for r 4— 0 to m - 1 do 

R «e- LeftMostZero(CoUNTER r ) 

S e- S + R 

return — • 2™ s 

_<P_ 

The corresponding Algorithm 3.6 is called the Probabilistic Counting 
algorithm with stochastic averaging (PCSA) and is also known as 
the Flajolet-Martin algorithm. In comparison to its version with m hash 
functions, it reduces the time complexity for each element to about 
0 ( 1 ). 
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Example 3.7: Cardinality estimation with stochastic averaging 
Consider the dataset and the hash values computed in Example 3.5 and 
apply a stochastic averaging technique simulating m = 3 hash functions. 
We use the remainder r to clioose one out of three counters and the quotient 
q to calculate the rank. 


City 

h(City) 

r 

1 

rank(g) 

Athens 

4161497820 

0 

1387165940 

2 

Berlin 

3680793991 

1 

1226931330 

1 

Kiev 

3491299693 

1 

1163766564 

2 

Lisbon 

629555247 

0 

209851749 

0 

London 

3450927422 

2 

1150309140 

2 

Madrid 

2970154142 

2 

990051380 

2 

Paris 

2673248856 

0 

891082952 

3 

Rome 

50122705 

1 

16707568 

4 

Vienna 

3271070806 

1 

1090356935 

0 

Washington 

4039747979 

2 

1346582659 

0 


Every counter handles information for about one-third of the cities, 
therefore, the distributiori is fair enough. After indexing all elements and 
setting the appropriate bits in the corresponding counters, our counters 
have the following forms. 


COUNTERq 


0 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

1 

0 

1 

1 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

16 

17 

18 

19 

20 

21 

22 

23 

24 

25 

26 

27 

28 

29 

30 

31 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

COUNTER 

1 













0 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

1 

1 

1 

0 

1 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

16 

17 

18 

19 

20 

21 

22 

23 

24 

25 

26 

27 

28 

29 

30 

31 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 
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COUNTER 2 


0 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

1 

0 

1 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

16 

17 

18 

19 

20 

21 

22 

23 

24 

25 

26 

27 

28 

29 

30 

31 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 


The left-most positions of zero for each counter (highlighted above) are 
Ro = 1, Ri = 3 and R 2 = 1. Thus, the estimation of the cardinality 
according to formula (3.5) is 


n 


3 

—2 k=0 

<P 


3 1 + 3+1 

- 2 3 

0.77351 


12.31. 


The computed estimation is very close to the true cardinality value of 
10, and even without using too many counters, it notably outperforms 
the estimation from Example 3.6. 


Properties 


The Flajolet-Martin algorithm works well for datasets with large 
cardinalities and produces good approximations when ^ > 20. However, 
additional non-linearities can appear in the algorithm for small 
cardinalities that usually require special corrections. 

One possible correction to the algorithm was proposed by Bjorn 
Scheuermann and Martin Mauve in 2007 [Sc07] which adjusted 
the formula (3.5) by adding a term that corrects it for small 
cardinalities and quickly converges to zero for large cardinalities: 

n « ^ (2* - 2~ x , (3.6) 

where x ~ 1.75. 


The Standard error 6 of the Flajolet-Martin algorithm is inversely 
related to the number of used counters and can be approximated as 


0.78 



8 


(3.7) 
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The reference values of the Standard error for the widely used number 
of counters can be found in Table 3.2. 

The length M of each counter Counter can be selected in a way that: 

M>log 2 (-)+4, (3.8) 

\m J 

thus, practicaly used M = 32 is enough to count cardinalities well beyond 
10 9 using 64 counters. 

Table 3.2: Trade-off between accuracy and storage (M = 32) 


m 

Storage 

8 

64 

256 bytes 

9.7% 

256 

1.024 KB 

4.8% 

1024 

4.1 KB 

2.4% 


The simple counters that have been built for different datasets can be 
easily merged together, that results in a Counter for the union of those 
datasets. Such merging is trivial and can be done by applying a bitwise 
OR operation. 

Like the Bloom filter, the Probabilistic Counting algorithms do not 
support deletions. But, following the approach used in the Counting 
Bloom filter, their inner bit arrays can be extended by counters and they 
will support probabilistically correct deletions. However, the increased 
storage requirements have to be taken into account. 


3.3 LogLog and HyperLogLog 

The rnost popular probabilistic algorithms to estimate cardinality used 
in practice are the LogLog family of algorithms that includes the LogLog 
algorithm, proposed by Marianne Durand and Philippe Flajolet in 
2003 [Du03], and its successors HyperLogLog and HyperLogLog++. 
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The algorithms use an approach that is similar to the Probabilistic 
Counting algorithm in a way that estimation of the cardinality n is done 
by observing the maximum number of leading zeros in the binary 
representation of values. They all require an auxiliary memory and 
perform a single pass over the data to produce an estimate of 
the cardinality. 

As usual, every element in the dataset is pre-processed by applying 
a hash function h that transforms elements into integers sufficiently 
uniformly distributed over a scalar range {0,1 ,..., 2 M -1} or, equivalently, 
over the set of binary strings 3 of length M: 

M-l 

h{x) = i = 4 • 2 k := (ioh • • • *m— l) 2 ,4 £ {0,1}. 

k =0 


The steps of the algorithms are similar to PCSA, which we reproduce 
here once again. First, it splits the initial dataset or input stream into 
some number of subsets, each of these is indexed by one of m simple 
counters. Then, according to the stochastic averaging, because there is 
a single hash function, we choose the counter for the particular element x 
using one part of its hash value h(x), while another part is used to update 
the corresponding counter. 

All algorithms discussed here are based on the observation of 
the patterns 0^1 that occur at the beginning of the values for 
the particular counter, and associate each pattern with its index, called 
rank. The rank is equivalent to the least significant 1-bit position in 
the binary representation of the hash value of indexed element and can 
be calculated by the formula (3.2). Each simple counter builds its own 
cardinality observation based on the seen ranks, the final estimation of 
the cardinality is produced from such observations using an evaluation 
function. 

In regards to storage, the counters in the Probabilistic Counting 
algorithm are relatively costly to maintain, but the LogLog algorithm 


'We use the “LSB 0” numbering scheme and start at zero for the least significant bit (LSB) 
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suggests a more storage-efficient solution together with a better 
evaluation function and bias correction approach. 

LogLog algorithm 

The basic idea of the LogLog algorithm starts with the computation of 
ranks for each input element based on a single hash function h. Since we 
can expect that ^ elements can have rank(-) = k, where n is the total 
nurnber of elements indexed into a counter, the maximal observed rank 
can provide a good indication of the value of log 2 n: 

R = max(rank(i)) « log 2 n. (3-9) 

igD 

However, such estimation has an error of about ±1.87 binary orders of 
magnitude, which is unpractical. To reduce the error, the LogLog 
algorithm uses a bucketing technique based on the stochastic averaging 
and splits the dataset into m = 2 P subsets So,Si,--- ,S m _i, where 
the precision parameter p defines the nurnber of bits used in navigation. 

Thus, for every element x from the dataset, the first p bits of the M-bit 
hash value h{x) can be taken to find out the index j of the appropriate 
subset: 

j = (*oh • • • *p-i) 2 ) 

and the rest (M - p) bits are indexed into the corresponding counter 
Counter [j] to compute the rank and get the observation R.j according 
to formula (3.9). 

Under fair distribution, every subset receives — elements, therefore 
observations Rj from the counters { Counter [ 7 ] }7Lq can provide 
an indication of the value of log 2 and using their arithmetic mean 
with sorne bias correction, we can reduce a single observation variance: 

m— 1 

- r Rj 

n = a m ■ m ■ 2 i=0 , (3.10) 

where a m = (-yy) • \og 2 ^ > ^(') S amma function. However, 

for rnost practical cases m > 64 it is enough to just use a m « 0.39701. 
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Algorithm 3.7: Estimating cardinality with LogLog 
Input: Dataset B 

Input: Array of m LogLog counters with hash function h 
Output: Cardinality estimation 
Counter[j] <—0,j = 0 ... m - 1 
for x G B do 

i 4- h(x) := (i 0 4 • • • *m-i) 2 ,4 £ {0,1} 
j i (io4 • • • 4 - 1)2 

r rank((44+i... *m-i) 2 ) 

Counter^'] max (Counter[?'], r) 

m— 1 

R <- ^ E Counter[j'] 

k =0 

return a m • m • 2 R 


Properties 

The Standard error 8 of the LogLog algorithm is inversely related to 
the number of used counters m and can be closely approximated as 


8 



(3.11) 


Hence, for m = 256 the Standard error is about 8% and for m = 1024 it 
decreases to about 4%. 

The storage requirements of the LogLog algorithm can be estimated as 
0(log 2 log 2 n) bits of storage if counts till n are needed. More precisely, 
the total space required by the algorithm in order to count to n is 
m-log 2 log 2 ^ (1 + 0(1)). 

In comparison to the Probabilistic Counting algorithm where each 
counter requires 16 or 32 bits, the LogLog algorithm requires much 
smaller counters {Counter \j ] IJlTo > usually of 5 bits each. However, 
while the LogLog algorithm provides better storage-efficiency than 
the Probabilistic Counting algorithm, it is slightly less accurate. 
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Assume that we need to count cardinalities till 2 30 , that is about 1 billion, 
with an accuracy of about 4%. As already mentioned, for such a Standard 
error, m = 1024 buckets are required, each of which will receive roughly 
— = 2 20 elements. 

m 

The log 2 (log 2 2 20 ) ss 4.32, therefore, it is enough to allocate about 5 bits 
per bucket (i.e., a value less than 32). Hence, to estimate cardinalities up 
to about 10 9 with the Standard error of 4%, the algorithm requires 1024 
buckets of 5 bits, which is 640 bytes in total. 


HyperLogLog algorithm 

An improvement of the LogLog algorithm, called HyperLogLog , was 
proposed by Philippe Flajolet, Eric Fusy, Olivier Gandouet, and Frederic 
Meunier in 2007 [F107]. The HyperLogLog algorithm uses 32-bit hash 
function, a different evaluation function, and various bias corrections. 

Similar to the LogLog algorithm, HyperLogLog uses randomization to 
approximate the cardinality of a dataset and has been designed to handle 
cardinalities up to 10 9 with a single 32-bit hash function h splitting 
the dataset into m = 2 P subsets, with precision p € 4... 16. 

Additionally, the evaluation function differentiates the HyperLogLog 
algorithm from the Standard LogLog. The original LogLog algorithm 
uses the geometric mean while the HyperLogLog uses a function that is 
based on a normalized version of the harmonic mean: 

( m -1 \ 

^ 2 -COUNTER[dj , ( 3 . 12 ) 

where 

"” = ( m r( log2 (fr^)) *) ' 

The approximate values of a m can be found in Table 3.3. 

The intuition behind using the harmonic mean is that it reduces 
the variance due to its property to tame skewed probability distributions. 
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Table 3.3 : a m for most used values of m 


m 

®-m 

2 4 

0.673 

2 5 

0.697 

2 6 

0.709 

> 2 7 

0.7213-m 

m+1.079 


However, the estimation (3.12), requires a correction for small and large 
ranges due to non-linear errors. Flajolet et al. empirically found that for 
small cardinalities n < |m to achieve better estimates the HyperLogLog 
algorithm can be corrected with Linear Counting using a number of 
non-zero CoUNTER[j] counters (if a counter has a zero value, we can say 
with certainty that the particular subset is ernpty). 

Thus, for different ranges of cardinality, expressed as intervals on 
the estimate h computed by formula (3.12), the algorithm provides 

h < |m and 3j : Counter[j] / 0 
n > t^j2 32 (3.13) 

otherwise. 

However, for n = 0 the correction it seems is not enough and 
the algorithm always returns roughly 0.7m. 

Since the HyperLogLog algorithm uses a 32-bit hash function, when 
cardinality approaches 2 32 ~ 4 • 10 9 the hash function almost reaches its 
limit and the probability of collisions increases. For such large ranges, 
the HyperLogLog algorithm estimates the number of different hash values 
and uses it to approximate the cardinality. However, in practice, there is 
a danger that a higher number just cannot be represented and will be 
lost, impacting the accuracy. 


the following corrections: 

{ LinearCounter, 

- 2 32 log (l — ^ 
h, 
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Consider a hash function that maps the universe to values of M bits. 
At most sucli a function can encode 2 M different values and if the estimated 
cardinality n approaches sucli a limit, the hash collisions become more and 
more probable. 

There is no evidence that some popular hash functions (e.g., 
MurmurHash3, MD5, SHA-1, SHA-256) perform signihcantly better 
than others in HyperLogLog algorithms or its modifications. 

The complete HyperLogLog algorithm is shown below. 


Algorithm 3.8: Estimating cardinality with HyperLogLog 
Input: Dataset B 

Input: Array of m LogLog counters with hash function h 
Output: Cardinality estimation 
Counter[j] 0, j = 0 ... m - 1 
for x G B do 

i h(x) ■= (io h ■ ■ • *3i) 2 ,4 ^ {0,1} 
j (*0*1 • • • *p-l)2 
r 4- rank((44 + i... i 3 i) 2 ) 

Counter[j'] max (Counter[j], r) 

m—1 . . 

r 2 _CouNTER b'] 

k =0 

h = a m • m 2 • ^ 
n 4 — h 

if h < | m then 

Z count (Counter^'] = 0) 

j=0...m—l 

if Z 7 ^ 0 then 

L n e- m • log(^) 

else if h > t^ 2 32 then 
n <-■ -2 32 • log (l - ^ 2 ) 

return n 
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Properties 

Similar to the LogLog algorithm, there is a ciear trade-off between 
the Standard error 8 and the nurnber of counters rn: 



The memory requirement does not grow linearly with the number of 
elements (unlike, e.g., the Linear Counting algorithm), allocating (M-p) 
bits for the hash values and having m = 2 P counters in total, the required 
memory is 

[log 2 (M + 1 - p )~| • 2 P bits , (3-14) 

moreover, since the algorithm uses only 32-bit hash functions and 
the precision p 6 4... 16, the memory requirements for 

the HyperLogLog data structure is 5 • 2 P bits. 

Therefore, the HyperLogLog algorithm makes it possible to estimate 
cardinalities well beyond 10 9 with a typical accuracy of 2% while using 
a memory of only 1.5 KB. 

For instance, the well-known in-memory database Redis maintains 4 
HyperLogLog data structures of 12 KB that approximate cardinalities 
with a Standard error of 0.81%. 

While HyperLogLog, in comparison to LogLog, improved 
the cardinality estimation for small datasets, it stili overestimates 
the real cardinalities in such cases. 

The variants of the HyperLogLog algorithms are implemented in well- 
known databases such as Amazon Redshift, Redis, Apache CouchDB, 
Riak, and others. 


PFCOUNT in Redis https://redis.io/commands/pfcount 
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HyperLogLogH—f- algorithm 

After some time, in 2013 [Hel3], an improved version of HyperLogLog 
was developed, the HyperLogLog++ algorithm, published by Stefan Heule, 
Mare Nunkesser, and Alexander Hali and focused on large cardinalities 
and better bias correction. 

The rnost noticeable improvement of the HyperLogLog++ algorithm 
is the usage of a 64-bit hash function. Clearly, the longer the output 
values of the hash function, the more different elements can be encoded. 
Such improvement allows to estimate cardinalities far larger than 10 9 
unique elements, but when the cardinality approaches 2 64 ~ 1.8 • 10 19 , 
hash collisions becorne a problem for the HyperLogLog++ as well. 

The HyperLogLog++ algorithm uses exactly the same evaluation 
function given by (3.12). However, it improves the bias correction. 
The authors of the algorithm performed a series of experiments to measure 
the bias and found that for n < 5m the bias of the original HyperLogLog 
algorithm could be further corrected using empirical data collected over 
the experiments. 

Additional to the original article, Heule et al. provided' 5 empirically 
determined values to improve the bias correction in the algorithm - 
arrays of raw cardinality estimates rawEstimateData and related 
biases biasData. Of course, it is not feasible to cover every possible 
case, so the rawEstimateData provides an array of 200 interpolation 
points, storing the average raw estimate measured at this point over 5000 
different datasets. biasData contains about 200 measured biases that 
correspond with the rawEstimateData. Both arrays are zero-indexed 
and contain precomputed values for all supported precisions p € 4... 18, 
where the zero index in the arrays corresponds to the precision value 4. 
As an example, for m = 2 10 and p = 10 the needed data can be found in 
rawEstimateData[6] and biasData[6]. 

The bias correction procedure in the HyperLogLog++ algorithm can 
be formalized as follows. 


Appendix to HyperLogLog in Practice http://goo.gl/iU8Ig 
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Algorithm 3.9: Correcting bias in HyperLogLog++ 
Input: Estimate n with precision p 
Output: Bias-corrected cardinality estimate 

Aow ^ 0, riup i 0, Jlow ^ 0, j \ i p i 0 

for j ■(— 0 to length(RAwESTlMATEDATA[p - 4]) do 
if RAWESTIMA'I’EDATA [p - 4] [j] > h then 

ilow j - 1) Jup j 

n\ow RAwEstimateData[p - 4][j low ] 
n np -e- rawEstimateData[p - 4] |j up ] 

L break 

6low ^ BIASDATA[p - 4] [j! ow] 

K p ^ biasData[p - 4] [ ?up ] 

y = interpolate {{n\ ov , ni ow - kaw), («u P , ^p - K P )) 
return y(n) 


Example 3.8: Bias correction using empirical values 
As an example, assume that we have computed the cardinality estimation 
h = 2018.34 using the formula (3.12) and want to correct it for the precision 
p = 10 (m = 2 10 ). 

First, we check the RAWEstimateData[6] array and determine that such 
a value h falis in the interval between values with indices 73 and 74 of 
that array, where RAwEstimateData[6][73] = 2003.1804 and 

rawEstimateData[6][74] = 2026.071: 

2003.1804 < h < 2026.071. 

Thus, we need to retrieve biases from the biasData[6] that are indexed at 
the same positions 73 and 74, which are biasData[6][73] = 134.1804 and 
biasData[6] [74] = 131.071. 

The correct estimation is in the interval: 

[2003.1804 - 134.1804,2026.071 - 131.071] = [1869.0,1895.0] 

and to compute the corrected approximation, we can interpolate that 
values, e.g., using k-nearest neighbor search or just by a linear interpolation 
y{x) = a -x+b, where j/(2003.1804) = 1869.0 and y( 2026.071) = 1895.0. 
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Thus, using simple calculations, the interpolation line is 
y = 1.135837- x -406.28725, 

and the interpolated value for our cardinality estimation is 
n = y{h) = 2/(2018.34) « 1886.218. 

According to experiments performed by the authors of 
the HyperLogLog++, the estimate nu n built according to the Linear 
Counting algorithm is stili better for small cardinalities even comparing 
to the bias-corrected value n. Therefore, if at least one empty counter 
exists, the algorithm additionally computes the linear estimate and uses 
a list of empirical thresholds, that can be found in Table 3.4, to choose 
which evaluation should be preferred. In such a case, the bias-corrected 
value n is used only when the linear estimate nn n falis above 
the threshold x m for the current m. 


Example 3.9: Bias correction with the threshold 

Consider Example 3.8, where for m = 2 10 we computed the bias-corrected 
value n ~ 1886.218. In order to determine whether or not we should 
prefer this value to the estimation by Linear Counting, we need to find 
out the number of empty counters Z in HyperLogLog++ data structure. 
Because we do not have that value in our example, assume it is Z = 73. 

Thus, the linear estimation according to the formula (3.1) is 



Next, we compare the nu n to the threshold x m = 900 from Table 3.4, which 
is far below the computed value, therefore, we prefer the bias-corrected 
estimate n to the Linear Counting estimate nu n . 
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Table 3.4: Empirical thresholds x m for the supported precision values 


p 

m 

Xm 

4 

2 4 

10 

5 

2 5 

20 

6 

2 6 

40 

7 

2 7 

80 

8 

2 8 

220 


P 

m 

Xm 

9 

2 9 

400 

10 

2 10 

900 

11 

2 11 

1800 

12 

2 12 

3100 

13 

2 13 

6500 


p 

m 

Xm 

14 

2 14 

11500 

15 

2 15 

20000 

16 

2 16 

50000 

17 

2 17 

120000 

18 

2 18 

350000 


The complete HyperLogLog++ algorithm is shown below. 


Algorithm 3.10: Estimating cardinality with HyperLogLog++ 
Input: Dataset B 

Input: Array of m LogLog counters with hash function h 
Output: Cardinality estimation 
Counter[j] 0, j = 0 ... m - 1 
for igDdo 

i h(x) := (*o*i... « 63 ) 2 ,4 £ {0,1} 
j 4— (44 • • • 4 - 1)2 
r 4- rank((44 + i... 4 3 ) 2 ) 

Counter[j] <- max (Counter^'], r) 

m—1 . _ 

^ 2^ CouNTER b1 

k =0 

h = a m • m , 2 ■ ^ 
n 4 — h 

if h < 5 m then 
j n •(— CorrectBias(n) 

Z <— count (COUNTER[jj = 0) 
j=0...m—l 

if Z ^ 0 then 

niin 4- m • log y 
if **lin < x m then 
[_ n 4- nun 

return n 
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Properties 

The accuracy of HyperLogLog++ is better than HyperLogLog for a large 
range of cardinalities and equally good for the rest. For cardinalities 
between 12000 and 61000, the bias correction allows for a lower error 
and avoids a spike in the error when switching between sub-algorithms. 

However, since HyperLogLog++ doesn’t need to store hash values, just 
one plus the maximum size of the number of leading zeros, the mernory 
requirements don’t grow significantly compared to HyperLogLog and, 
according to (3.14), it requires only 6 • 2 P bits. 

The HyperLogLog++ algorithm can be used to estimate cardinalities of 
about 7.9 • 10 9 elements with a typical error rate of 1.625%, using 2.56 KB 
of memory 6 . 

As mentioned earlier, the algorithm uses the stochastic averaging 
approach and splits the dataset into m = 2 P subsets {SyjJLg 1 , each of 
which has associated counters {Counter[j]}JLq, every counter handles 
information about ^ elements. Heule et al. noticed that for n <C m 
most counters are never used and don’t need to be stored, therefore 
the storage can benefit from a sparse representation. If the cardinality 
n is rnuch smaller than to, then HyperLogLog++ requires significantly 
less memory than its predecessors. 

The HyperLogLog++ algorithm in a sparse version Stores only pairs 
(j, Counter[j]), representing them as a single integer by concatenating 
their bit patterns. All such pairs are stored in a single sorted list of 
integers. Since we always compute the maximal rank, we don’t need 
to store different pairs with the same index, instead only the pair with 
the maximal index has to be stored. In practice, to provide a better 
experience one can maintain another unsorted list for fast insertions that 
have to be periodically sorted and nrerged into the primary list. If such 
a list requires more memory than the dense representation of the counters, 
it can be easily converted to the dense form. Additionally, to make 


Micha Gorelick and Ian Ozsvald, High Performance Python, 2014 
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the sparse representation even more space friendly, the HyperLogLog++ 
algorithm proposes different compression techniques using variable length 
encoding and difference encoding for the integers, therefore storing only 
the first pair and differences frorn its value. 

Currently, the HyperLogLog++ algorithm is widely used in many 
popular applications, including Google BigQuery and Elasticsearch. 


Conclusion 

In this chapter we covered various probabilistic approaches to counting 
unique elements in huge datasets. We have discussed the difficulties 
that appear in cardinality estimation tasks and learned a simple solution 
that could approximate the small cardinalities quite well. Further, we 
have studied the family of algorithms based on an observation of certain 
patterns in the hashed representations of elements from the dataset which 
is followed by many improvements and modifications that have becorne 
industry Standard today for estimating cardinalities of almost any range. 

If you are interested in more information about the material covered 
here or want to read the original papers, please take a look at the list of 
references that follows this chapter. 

In the next chapter we consider streaming applications and study 
the efficient probabilistic algorithms to estimate frequencies of elements, 
find heavy hitters and trending elements in data streams. 
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4 

Frequency 


Many important problems with streaming applications that operate large 
data streams are related to the estimation of the frequencies of elements, 
including determining the rnost frequent element or detecting the trending 
ones over sorne period of time. 

As seen in other problems, when data streams are large enough (they 
can be seen as an infinite sequence of elements) and have a big number 
of distinet elements, the usual Solutions, like sorting or keeping counters 
for every element, are not possible anymore. It is also important to note 
that in most cases it isn’t feasible to store and re-process such sequences, 
therefore one-pass data stream algorithms are required. 
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If the data stream is large but has a low cardinality (contains only a small 
number of distinet elements), it is enough to maintain exact frequency 
counters using one counter per distinet element, in tlrese cases, tlrere is no 
need for special algorithms. 

The specificity of Big Data applications that handle large data 
streams requests that appropriate data structures and algorithms fulfill 
the following requirements: 

• Make one pass through the data. 

• Have sublinear space (polylogarithmic at most), meaning they don’t 
grow as fast as the input stream does. 

• Support fast and simple updates with some guarantee of accuracy. 

Because of the space restrictions, it is ciear that such structures need 
to operate data in a compressed form that is some summary of the data 
stream (e.g., sketeh) and makes it not possible to compute most functions 
over the stream precisely, therefore probabilistic approximation is needed. 

Let’s start with formal definitions. By data stream D = {aq, X 2 ,■ ■ ■, x n } 
we mean a sequence of elements of any nature, assuming that the number 
of elements n is very large, e.g., billions, and there is an unknown large 
number of distinet elements. If the stream is truly infinite, D can be 
seen as a substream if viewed in a time window. 

With an approach to estimating frequencies of elements in a huge 
data stream, we can address the common problem of hnding the list of 
high-frequency elements in a stream, known as the Frequent problem. 

When we look for an element that occurs more than ^ times in data 
stream D, we consider the Majority problem that was formulated as 
a research problem by J. Strother Moore in the Journal of Algorithms in 
1981 [Bo81]. We can postulate that such an element exists in the stream, 
which is not always true, and, by dehnition, it is ciear that it can be 
only one element for the given data stream, which is called the majority 
element. 
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Obviously, this is mostly a toy problem, but it gives us a better 
understanding of the frequency problems involved with data streams. 

One of the most complex problems that appears in practice while 
working with big data is the problem of finding the top k most frequent 
elements in the stream that occur more than ? times, also known as 
heavy hitters, where k <C n these are usually 10, 100 or 1000. 

Searching for heavy hitters presupposes that some elements can occur 
significantly more often than others in the data stream, otherwise there is 
no sense in solving this problem. 

This might be surprising, but it was proven that there is no algorithm 
that can solve the Heavy hitters problem in one pass using a sublinear 
space. 

Many practical applications are connected to the Heavy hitters problem, 
including search, log mining, network analysis, traffic engineering, and 
anomaly detection. For instance, we might want to determine the heaviest 
k users (for a desired value of k <C n) for a high-trafhc website. However, 
some users may have nearly equal load and getting an exact answer to 
this question is impossible using limited space. 

In practice, we consider the e-Heavy hitters problem 
an e-approximation of the Heavy hitters problem, that results in 
elements that occur at least j time with a guaranteed occurrence at 
least j - e ■ n, where e > 0 and is small. For instance, for e = ^ > 0 
the output of the e-Heavy hitters problem will be elements with 
frequencies at least ? and a guarantee that they occur at least 

l~ z - n = l-Jk = 4 times - 

For small data streams (regardless of the number of unique elements) it 
is enough to just sort the elements and, using a linear scan, find elements 
that occur at least j these will be the heavy hitters. 


In an arbitrary data stream D, there are frorn zero to k heavy hitters 
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and, in contrast to the Majority problem, it is much more likely that for 
some k at least one heavy hitter exists, while the majority element doesn’t. 
Therefore, the Majority problem could be seen as a particular case of 
the Heavy hitters problem with the requirement that such a majority 
element exists and k ~ 2 - e, where e > 0 and is srnall. 


Example 4.1: DNS DDoS attack detection (Afek et al., 2016) 

A distributed denial-of-service (DDoS) attack includes many Systems 
flooding the resources of the targeted System, typically by sending a large 
number of queries from a botnet. One popular target is the Domain 
Name System (DNS) that plays the role of a “phonebook” of the Internet, 
providing the translation between easy to remember domain names and 
IP addresses of websites. 

DNS queries are considered a data stream where each element has 
an associated domain to resolve. Going further we can group the queries 
using their top-level domain and by investigating the heaviest domains in 
the query stream we can detect the randomized DNS Flood when queries 
for many different non-existent subdomains of the same primary domain 
are issued. 


Another interesting task in streaming applications, called 
the Max-Change problem , is to determine elements whose frequencies 
changed the most across different data streams or time Windows. This 
problem has a practical importance for search engines since the queries 
whose frequency changes most between two consecutive time periods can 
indicate which topics are increasing or decreasing in popularity at 
the fastest rate. 


Example 4.2: Trending Twitter hashtags 

A hashtag is used to index a topic on Twitter and allows people to easily 
follow items they are interested in. Hashtags are usually written with a # 
Symbol in front. 

Every second about 6000 tweets 1 are created on Twitter, that is roughly 
500 billion items daily. Most of these tweets are linked with one or more 
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hashtags and to keep abreast of all the latest events it is important to 
determine the most popular topics of the day. 

This can be done by processing the data stream of tweets, estimating 
the frequencies of each hashtag, and finding the most frequent values. 
Additionally, it might be useful to compare the frequencies of yesterday’s 
and today’s values to determine the topics that are trending, e.g., those 
that have the most increasecl frequencies since yesterday. 


Here we study various approaches to solve frequency-related problems 
in Big Data streams. We start with very simple deterministic algorithms 
and afterward learn modern probabilistic alternatives that can efficiently 
address real-world problems. 


4.1 Majority algorithm 

Without any additional investigation, it is possible to suggest a linear-time 
solution for the Majority problem because the majority element (of course, 
with the assumption that it exists) is the median. The disadvantage is 
that it requires multiple passes through the stream and, therefore, is not 
suitable for Big Data streams. 

The Majority algorithm, also known as the Boyer-Moore Majority 
Vote algorithm, was invented by Bob Boyer and J. Strother Moore in 
1981 [Bo81] to solve the Majority problem in a single pass through 
the data stream. A similar solution was independently proposed by 
Michael J. Fischer and Steven L. Salzberg in 1982 [Fi82]. 

The data structure for the Majority algorithm is fairly simple and is 
a pair made up of an integer counter and the so-called monitored element: 
S = (c,x*). Therefore, it requires a constant amount of memory, but 
its size varies and depends on the size of the elements. 

Such a data structure supports only one simple update operation, 


1 Twitter Usage Statistics https://www.internetlivestats.com/twitter-statistics/ 







98 


Chapter 4: Frequency 


which updates the counter and selects the candidate for the monitored 
element based on its previous state and current element x. 


Algorithm 4.1: Updating the Majority data structure 
Input: Element igD 
if c = 0 then 

j x* x 

if x = x* then 

j c <— c + 1 

else 

[_ C i — C — 1 


With this type of data structure, it is simple to describe the algorithm. 
For every element x in the strearn B the algorithm triggers the update 
procedure given by Algorithm 4.1 and, under the requirement that 
the majority element exists, it returns the last monitored element as 
the majority element. Note, that the value of the counter is not 
the frequency of the majority element. 

Algorithm 4.2: MAJORITY ALGORITHM 
Input: Data strearn B 
Output: Majority element 
C i — 0 

X* <- NULL 
for iFBdo 
[_ UPDATE(r) 

return x* _ 

In the Majority algorithm, every “non-majority” value that follows can 
decrease the counter c or even reset it to 0, which forces the re-election of 
the monitored element x*. Frorn a non-precise view, it might be unclear 
how such an algorithm ends up with the correct value and whether there 
is a danger that for sorne cases all majority values could be eliminated. 
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The following “non-majority” values can wipe out only one copy of 
the previous majority element, but as there has to be more than j 
majority values in the data stream, there will not be enough “non- 
majority” values and at least one copy of the majority value will be left 
at the end. This also explains why the returned value of the counter 
cannot be used as an approximation of the majority element’s frequency. 


When the majority element doesn’t exist, the output of the Majority 
algorithm is an arbitrary element of the data stream. Therefore, applying 
such an algorithm when we are uncertain about the existence of the majority 
element would require another pass through the data stream with a simple 
counter to verify that the element given by Algorithm 4.2 is actually 
the majority element that occurs more then times. 


Example 4.3: Majority algorithm 

Consider a dataset of n = 10 elements: {4,4,3, 5, 6 ,4,4,4,4, 2}, where 
the obvious majority element is x = 4 as it occurs 6 times out of 10. 

According to the algorithm, we allocate a pair S = (c, x*) = (0,NULL) 
and start consuming elements from the dataset. The first element is x\ = 4 
and, since our counter c is empty, we store it as the monitored element 
x* = 4 and increase the counter: c = 1. The next element X 2 is 4 again, 
which is equal to the monitored element, so we just increase the counter: 
c = 2. The third input element is x$ = 3 that is different from the x* = 4, 
thus we decrement our counter: c = 1. Similarly, after processing aq = 5, 
we decrement the counter again and it becomes zero: c = 0 . 

Next, we process element £5 = 6 and, since the current counter value is 
zero, we update the monitored element x* = 6 and set its counter: c = 1. 
However, it doesn’t stay too long and after handling elements Xq = 4 and 
Xy = 4, the monitored element becomes x* = 4 again with counter c = 1. 
The next element will increment counter c = 2, but the last element is not 
equal to 4 and the counter will be decremented again to the value c = 1. 

In the end, the correct majority element 4 is left as the monitored element. 
However, note that the remaining counter c is not a frequency estimator 
and contains a completely different value. 





100 


Chapter 4: Frequency 


The Majority algorithm is the most popular algorithm for 
undergraduate classes due to its simplicity. In the next section we study 
its extention that can already solve the Frequent and Heavy hitters 
problems. 


4.2 Frequent algorithm 

A generalization of the Majority algorithm, known as the Frequent 
algorithm, was proposed by Erik D. Demaine, Alejandro Lopez-Ortiz, 
and J. Ian Munro in 2002 [De02], years after the original algorithm. At 
sorne point, it was discovered that the algorithm was actually the sanie 
as the algorithm published by Jayadev Misra and David Gries in 
1982 [Mi82], known now as the Misra-Gries algorithm. 

The Frequent algorithm is designed to address the Heavy hitters 
problem and instead of keeping only one counter like in the Majority 
algorithm, the Frequent data structure consists of a set of monitored 
elements X* and an array of p counters C = {c,}f =1 . 

Whenever we process a new element from the data stream, we first 
check if it is already monitored. However, if the element is new, we add 
it into the X* only when we have room in the set since we maintain at 
most p elements. If the element wasn’t added, we stili want to reflect its 
presence in the stream by decrementing the counters of all elements in 
the set of monitored elements. When the element already exists in the X*, 
we just increment its associated counter. At the end of the procedure, 
we go through the list and pop up all elements whose counters hit zero. 

In the original article, Misra and Gries used balanced search trees 
to represent the Frequent data structure, however, future researchers 
preferred to use hash tables and implement it as a dictionary. 
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Algorithm 4.3: Updating the Frequent data structure 
Input: Element x € ID 

Input: Frequent data structure with p counters 
if x ^ X* then 

if 3m : c m = 0 then 

| _ x m ^ x 

if x £ X* then 

3m : x* x = x 
C m i C m + 1 

else 

for j 4— 1 to p do 
if Cj > 0 then 

[_ Cj 4- Cj~ 1 

for j 4— 1 to p do 
if Cj = 0 then 

[xvr\ {x*} 


The Frequent algorithm uses the Frequent data structure of length p 

to discover elements that occur at least —tt times in the data stream of 

p +i 

length n. Thus, to determine up to k — 1 heavy hitter elements that occur 
at least ^ times in the data stream we need to use p = k - 1 counters. 

Algorithm 4.4: Frequent algorithm 
Input: Data stream D 

Input: Frequent data structure with k — 1 counters 
Output: Heavy hitters elements 

C := { Ci} f=i 5 Ci <— 0 

X* 4- 0 

for ifDdo 
|_ Update(r) 
return X* 


The intuition behind the Frequent algorithm is very similar to 
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the Maj ority algorithm given the requirement that heavy hitter elements 
occur more than j times. 

Example 4.4: Find heavy hitters witli the Frequent algorithm 

Consider a data stream of n = 18 elements: 

{4,4,4,4,6,2,3,5,4,4,3,3,4,2,3,3,3,21. 

To identify heavy hitter elements that occur in the data stream at least 
= 6 times, we allocate a Frequent data structure of p = 2 counters 
and use Algorithm 4.4 to identify at most two of three possible heavy 
hitters. 

1 2 
X* 

C 

We start with element 4. Since it isn’t in the X* and there are no elements 
in the data structure, we freely add element 4 in the set of monitored 
elements and increment the associated counter c\ = 1. 

1 2 
X* 

C 

Similarly, we process the next three elements that are also equal to 4. 
Because this element is already in the X*, we simply increment its 
counter Cj. 

1 2 
X* 

C 

The next element is 6 which is not monitored yet. Since the set X* has 
space, we insert element 6 into it and set the counter C 2 = 1. 

1 2 
X* 

C 

Next, we take element 2 which is also not in the X*, however we have no 
room in the set and cannot add it. Otherwise, we decrement counters of 


4 

6 

4 

1 


4 


4 

0 


4 


1 

0 




0 

0 
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all elements currently in the X*. According to Algorithm 4.4 we also need 
to remove from the monitored set elements whose counters hit zero. In 
our example, this is element 6 which is therefore removed from the set. 

1 2 
X* 

C 

Next, we consume element 3 from the data stream. This element is not 
in the X* and since the set has enough space, we just add it in and set 
the associated counter C 2 = 1. 

1 2 
X* 

C 

Continuing in a similar way, we process all remaining elements and the 
final data structure becomes: 

1 2 
X* 

C 

Thus, the identified heavy hitters are elements 4 and 3. However, 
the counters do not reflect the actual frequencies on the elements in 
the data stream, as we also noted for the Majority algorithm. 


4 3 
3 3 


4 

3 

3 

1 


4 


3 

0 


Properties 

The time cost of the algorithm is dominated by the 0(1) dictionary 
operations per update and the cost of decrementing counts. To optimize 
the speed of the algorithm, all counters can be decremented at once, in 
constant time by organizing them in a sorted order and using difference 
encoding , where the only information stored is about how much larger 
the particular counter is compared to the next smallest one. Minimizing 
significant movements in the order while incrementing and decrementing 
the counter means all equal counters can be grouped. With such 
an optimized data structure each counter no longer needs to store 
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a value, but rather its group. Thus, the Frequent algorithm can be 
augmented to run in 0(1) time. 

Actually, even without any probabilistic approach, the algorithm 
delivers at most k — 1 candidates in the Heavy hitters problem. However, 
it is focused on the determining the high-frequency elements without 
their correct frequencies approximat io ns. There fore, if we want to 
estimate frequencies of the elements, the second pass through the data 
stream is required which is unfeasible in most cases of handling huge 
data streams. 

Thus, in the next sections we continue studying Solutions for frequency- 
related problems with very effective probabilistic data structures that 
are perfectly suited to Big Data streams. 


4.3 Count Sketch 

A space-efficient algorithm that is used to solve many frequency-related 
data stream problems is the Count Sketch that was proposed by Moses 
Charikar, Kevin Chen, and Martin Farach-Colton in 2002 [Ch02]. They 
had a practical requirement to create a space-efficient data structure that 
could easily maintain approximate counts of high-frequency elements in 
a data stream. 

In order to better understand the problem that Count Sketch solves 
we note that the idea of a Counting Bloom filter could also be used to 
compute frequencies of elements in a data stream, however, it is not 
enough to build precise frequency estimators. 

Consider a data structure with an array C = {c l }™ =1 of m counters and 
p hash functions h\, ha ,..., h p that map frorn elements to {1,2,..., m}. 
The indexing of element x from the data stream into such a data structure, 
like with the Counting Bloom filter, includes computing {hj(x)} p J=l and 
incrementing the corresponding counters c h] ^, j = 1.. .p in the array. 

When we need to hnd a frequency f(x ) of element x , we simply 
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compute the values of each hash function for that element and get 
values of the corresponding counters c 1 , c 2 ,..., c m that play the role of 
frequency estimations. 

However, because the counters are never decremented and the hash 
functions use the same array, it is ciear that such estimations will be 
bigger than the real frequency f(x) of the element: 

f(x) < c 1 , i = 1... m, 

the inequality is the resuit of possible hash collisions when different 
elements update the same counters. In other words, the one-sided error 
common to our estimations rnaking them ali upper bound estimates. 

The idea of Count Sketch is to solve this problem by building lower 
bound as well as the upper bound estimations. To prevent situations 
when collisions with high-frequency elements spoil most estimates of lower 
frequency elements, this requires a random decision when to decrease 
the counter and when to increment it. In order to reduce the variance it 
additionally takes the median of those estimations. 

The CountSketch data structure designed to store the frequencies 
of m high frequency elements consists of a p x m array of counters {cj} 
that can be seen as an array of p hash tables, each of m buckets. 
Additionally, it uses p hash functions /q, fi 2 , .. .., h p that map from 
elements to {1,2 ,,m} and p hash functions si, S 2 ,..., s p that map 
from elements to {+1,-1} in order to support both side approximation 
to the real frequency value. It is assumed that hash functions hi and s, 
are pairwise independent and independent of each other. 

The data structure allows counters to be updated for each indexed 
element and estimates the number of times the element has been seen 
in the past, this is used as the frequency estimation for the element. 
Every time we index a new element x, the counters <y*^ for each 
row j of the sketch can be either incremented or decremented, based on 
the values of Sj(x). Therefore, it is possible that the counters overestimate 
the frequency of the element x, as well as underestimate it. 
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Algorithm 4.5: Updating the Count sketch 
Input: Element x G B 
Input: Count sketch with p x m counters 
f or j +- 1 to p do 

i +- hj(x ) 

Cj +- C* + Sj(x) • 1 


Assuming that every hash function {hj}? =1 and {sj}j =1 can be 
computed in a constant time, the running time for the update procedure 
given by Algorithm 4.5 is O (p). 

Example 4.5: Build Count sketch 

Consider a dataset of n = 18 elements: 


{4,4,4,4,2,3, 5,4, 6,4,3,3,4, 2,3,3,3, 2} 

and let’s build a CountSketch data structure of m = 5 counters using 
p = 3 hash functions based on MurmurHash3, FNVla and MD5 to decide 
which counter to update: 

hi(x) ■= MurmurHash3(:r) mod 5 + 1, 
h 2 (x) ■= FNVla(a;) mod 5 + 1, 
h 3 (x ) := MD5(a;) mod 5 + 1, 

and three hash functions to determine the direction of the update: 


si(a:) := MurmurHash3(a;) mod 2 ? — 1 : 1, 
s 2 (x) := FNVla(a:) mod 2 ?■ 1:1, 

S 3 (x) := MD5(a;) mod 2 ? - 1 : 1. 

In the beginning, the CountSketch data structure consists of zeros: 



1 

2 

3 

4 

5 

hi 

0 

0 

0 

0 

0 

h 2 

0 

0 

0 

0 

0 

h 3 

0 

0 

0 

0 

0 
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We start processing elements from the dataset and the first element is 4. 
According to Algorithm 4.5, we compute its hash values /ii(4), / 12 ( 4 ), and 
/ 13 ( 4 ) to determine counters that have to be updated: 

ii = /ii(4) =3, i 2 = / 12 ( 4 ) =3, i 3 = / 12 ( 4 ) = 1. 

In this case, two hash functions deliver the same value, but since we 
maintain dedicated lists of counters for each hash function this is not 
a problem. To determine the direction of updates we compute hash values 
si(4), s 2 (4), and s 2 (4): 

si(4) = 1, s 2 (4) = 1, s 2 (4) = -1. 

Thus, we increment counters cf and c 2 , while decrementing counter C 3 . 
The resulting CountSketch data structure becomes as follows. 



1 

2 

3 

4 

5 

hi 

0 

0 

1 

0 

0 

^2 

0 

0 

1 

0 

0 

h 3 

-1 

0 

0 

0 

0 


The next three elements are 4 too, hence we increment or decrement 
the same counters three more times: 



1 

2 

3 

4 

5 

h 

0 

0 

4 

0 

0 

^2 

0 

0 

4 

0 

0 

h 3 

-4 

0 

0 

0 

0 


Next element in the dataset is 2 and its corresponding indices are i\ = 3, 
* 2 = 2, and 13 = 3. The values of direction hash functions are si(2) = 1, 
s 2 (2) = 1 , and S 3 ( 2 ) = - 1 , so we increment counters cf and cf, and 
decrement cf. Note, that there is a soft collision and element 2 changes (in 
the same direction) the counter used by element 4. This makes the value 
in the counter cf overestimate the real value for both elements. 



1 

2 

3 

4 

5 

hi 

0 

0 

5 

0 

0 

h 2 

0 

1 

4 

0 

0 

h 3 

-4 

0 

-1 

0 

0 
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In the same way, we process all remaining elements. For element 3 we 
decrement counters c\ and cf, and increment counter c|; for element 5 
we decrement counters cf and cand increment c|; for element 6 we 
decrement counters cf and c|, and increment cf. 

The final CountSketch has the following form: 


hi 

h 2 

^3 


It is known in probability theory that the usual procedure for building 
better approximations from a number of randomly distributed trials is 
to use the rnean and median. The Count Sketch algorithm, to compute 
the final estimation of the frequency, uses the median because it is robust 
and less sensitive to outliers. 


1 2 3 4 5 


-6 

0 

9 

-1 

0 

1 

3 

1 

-1 

0 

-7 

0 

-4 

7 

0 


Algorithm 4.6: Estimating frequency with the Count Sketch 
Input: Element x £ B 

Input: Count-Min sketch with p x m counters 
Output: Frequency estimation 

/ = {/,}?=! 

for j <— 1 to p do 

i <— hj(x) 

_ fj s j ( x ) ' c j 

return median(/i,/ 2 , 


The update time for each element is O (p) and to find the median of p 
elements we spend some linear time using one of the selection algorithms, 
therefore the overall query time is O (p). 
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Example 4.6: Frequency estimation with Count Sketch 

Consider the data structure that we built in Example 4.5: 



1 

2 

3 

4 

5 

h 

-6 

0 

9 

-1 

0 


1 

3 

1 
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Let’s estimate the frequency of element 4 whose corresponding counters are 
cf, cf, and cf , the update directions are Si(4) = 1, s 2 (4) = 1, and S2(4) = 
1, as we determined earlier. Using Algorithm 4.6, as the estimation we 
calculate the median of weighted values of those counters: 

f = median(si(4) • cf, «2(4) • cf, Ss(4) • cf) = median(9, 1, 7) = 7. 

Thus, the estimated frequency of element 4 is 7, that is also the correct 
count from the dataset. 

Now, consider element 2 with the corresponding counters cf, cf, and c| 
with values of direction hash functions si(2) = 1, s 2 (2) = 1, and S 3 ( 2 ) = -1. 

Thus, the frequency estimation for element 2 is 

/ = median(si(2) • cf, « 2 ( 2 ) • cf, 53 ( 2 ) • cf) = median(9, 3,4) = 4, 
which overestimates the real value 3. 


The Count Sketch algorithm can be used to find the top k most 
frequent elements, known as the Frequent problem. In a single pass 
through the data stream, as well as the regular p x m array of counters 
and the hash functions {/ij}J =1 and {sj}J =1 , we maintain a set X* of k 
high-frequency elements. We first index every element x from the data 
stream to the CountSketch data structure according to Algorithm 4.5. 
Then, if the element is not in the set X* and there is capacity to add it, we 
insert the element. Otherwise, we estimate frequency with Algorithm 4.6 
and if it is greater than the smallest frequency in the set, we add element 
x to X* while removing the element with the smallest frequency. 














110 


Chapter 4: Frequency 


Algorithm 4.7: Getting frequent elements with the Count Sketch 
Input: Data stream D 

Input: Count-Min sketch with p x m counters 
Output: Top frequent elements 
X* <- 0 

for iGDdo 
Update(x) 

if x G X* then 
L continue 

if |X*| < k then 

[xvru {x} 

else 

f •(— Frequency(x) 

«in Jmin) min (EYequency (:r*)) 

i*ex* 

if / > /min then 

L L X* ^ X* u {x} \ « in } 

return X* 


Example 4.7: Most frequent elements with Count Sketch 
Consider the same setup as in Example 4.5 and search for k = 3 most 
frequent elements in the dataset: 


{4,4,4,4,2,3,5,4,6,4,3,3,4,2,3,3,3,21. 

According to Algorithm 4.7, additional to the CountSketch data 
structure, we create a set X* to store frequent candidates. 

We start consuming the dataset and the first element is 4, so, as we 
know from Example 4.5, we need to increment counters cf and cf, while 
decrementing counter c\. 
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The set X* is empty, thus we freely insert element 4 into it: X* = [4]. 

The next tliree elements in the data stream are also equal to 4, so we index 
them into the data structure without any changes to X*. 
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The next element is 2 and we increment counters cf and cf, and decrement 
c|, as we determined earlier. This element is not in the set of the most 
frequent candidates and since X* has enough capacity, we add element 2 
into the set: X* = [4, 2]. 
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The next input element is 3. To index it into the CountSketch data 
structure we decrement counters c\ and cf, and increment counter c|. 
Since the set X* contains only two elements out of three possibles, we add 
element 3 into the set: X* = [4, 2, 3]. 
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Next, we take element 5 from the dataset and update the sketch by 
decrementing the counters cf and c|, and incrementing c|. 
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Element 5 is not in the set X* whiclr has reached its maximum capacity of 
k = 3 monitored elements. Thus, we need to estimate the frequencies of 

































112 


Chapter 4: Frequency 


elements in the set and of element 5 using Algorithm 4.6 for the current 
CountSketch data structure. 

/(5) = median(-cJ,-C2, c|) = median(-4,1, 2) = 1, 

/(4) = median(ci, c^-Cg) = median(4,3,4) = 4, 

/(2) = median(ci, c|,-c|) = median(4,1,1) = 1, 

/(3) = median(-c|,-cf, c|) = median(l,-3,2) = 1. 

Therefore, the estimated frequency of the current element 5 doesn’t exceed 
the minimum frequency of elements in the set, so we don’t change the set 
of monitored elements: X* = [4, 2, 3]. 

In a similar manner we handle all remaining elements from the dataset 
and after processing the last one, the CountSketch data structure has 
the following form: 
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and the set of the three most frequent elements is 

X* = [4,2,3]. 

Note, that the most frequent elements in X* are not ordered and to 
estimate their frequencies we can use Algorithm 4.6. 


In the same way we can address the Heavy hitters problem. To hnd 
k heavy hitters we maintain a counter N of already processed elements, 
using which, we calculate the frequency threshold f* = j: every time a 
new element is indexed. If the estimated frequency of the current element 
is above the threshold, we insert it into the heap X* as a candidate 
for heavy hitters. Additionally, on every step we remove elements from 
the heap whose stored frequency fall below the actual threshold f*. 
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Algorithm 4.8: Determining heavy hitters with the Count Sketch 

Input: Data stream D 

Input: Count sketch with p x m counters 

Output: Heavy hitters 

N <r- 0, X* 0 

for igDdo 

N <- N + 1 

Update(x) 

/ Frequency(x) 

4** , _ N 

J \ k 

if / > /* then 

L X* •$— X* U 

for (x*,f) £ X* do 
if / < f* then 
L X*<-X*\ {(**,/)} 

return X* _ 

The Count Sketch algorithm can also be applied to finding elements 
with the largest frequency change, otherwise known as the Max-Change 
Problem. Having data streams of two comparable periods, we can build 
a CountSketch data structure for each of them and maintain the heap 
X* of elements with the largest differences. Every time new elements 
are indexed, we estimate their frequencies using Algorithm 4.6 and 
update the heap to keep only elements with the most change. Finally, 
the algorithm outputs k elements with the largest values of frequency 
change. 


Properties 

The Count Sketch provides the guarantee that the estimation error for 
frequencies is not bigger than e • n with probability at least 1-8. 
The increasing nurnber of hash functions p decreases the probability of 
a bad estimate and for the desired Standard error 6 the recommendation 
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on the number of hash functions, that correspond to the rows in 
the CountSketch, is 



(4.1) 


The bigger m, the less likely that collisions will happen, meaning 
a lower estimation error e • n. At the same time, with bigger p more 
estimators are used to calculate the final value, which makes it more 
reliable. The recommendation on the number of counters m is 


m 


f 2.718281 


(4.2) 


The overall space required by the Count Sketch data structure is 
O(m • p + 2 p), because we keep a count matrix sized p x m and two hash 
functions per row. 

If two Count Sketch data structures have the same size m , they can 
be easily added to and subtracted from each other, this is useful for 
distributed strearn processing. 

There are implementations of Count Sketch for Apache Hive and 
other data warehouse Software, but modern applications prefer to use 
its successor, the Count-Min Sketch algorithm, due to it requiring less 
space and execution time. 


4.4 Count—Min Sketch 

Count-Min Sketch is a simple space-efficient probabilistic data structure 
that is used to estimate frequencies of elements in data streams and can 
address the Heavy hitters problem. It was presented in 2003 [Co03] by 
Graham Cormode and Shan Muthukrishnan and published in 2005 [Co05]. 

As we saw in the previous section for Count Sketch, the main obstacle 
to the direct application of the Counting Bloom filter in frequency 
estimation tasks is that it shares a single array of counters for all hash 
functions and, consequently, suffers from hard and soft collisions. 



4.4 Count-Min Sketch 


115 


The quality of the estimation is hardly affected by the probability of 
hash collisions even though they lead to overestimations for counters. 
However, when the number of elements in the data stream is huge, 
collisions with high-frequency elements are almost certain and this 
rnakes such an approximation useless due to the large overestimation of 
all counters. 

By treating this problem as a lack of highly confident estimates to 
compute frequency with sufficient precision, the Count-Min Sketch 
algorithm replaces the single array of m counters with a hash table of p 
arrays of m counters and, instead of updating each counter by every 
element, lets the elements update different subsets of counters, one per 
hash table. The purpose of m is to compress the data stream 
B = {xi, X 2 ,..., x n } and because m <C n this is a “lossy” compression 
that leads to errors. To reduce these errors, the algorithm introduces 
many independent trials by using p hash functions with a dedicated 
array of m counters for each. 

The CMSketch is a space-efficient data structure that consists of 
a p x m array of counters {cj}, where p pairwise independent hash 
functions h \, /12, ■ • •, h p map the universe to the range { 1 , 2 ,..., m}. 

Such a simple data structure allows for the indexing of elements frorn 
the data stream, results in updating counters, and can provide the number 
of times every particular element has been indexed, which can be seen 
as the frequency estimation for the element. 

Algorithm 4.9: Updating the Count-Min sketch 
Input: Element 16 D 

Input: Count-Min sketch with p X m counters 
f or j 4 — 1 to p do 

i hj(x ) 


Assuming that every hash function {hj } j can be computed in 
a constant time, the running time for the update procedure given by 
Algorithm 4.9 is O (p). 
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Example 4.8: Build Count-Min sketch 

Consider the dataset of n = 18 elements from Example 4.5: 

{4,4,4,4,2,3,5,4,6,4,3,3,4,2,3,3,3,21 

and let’s build a Count-Min sketch of m = 4 counters using p = 2 hash 
functions based on MurmurHash3 and FNVla: 

hi(x) := MurmurHash3(:r) mod 4+1, 
h 2 (x) := FNVla(:r) mod 4+1. 

In the beginning, the CMSketch data structure consists of zeros: 


h 

h 2 

We start consuming elements from the dataset. The first element is 4 
and, according to Algorithm 4.9, we compute its hash values to determine 
counters that have to be updated: 

k = h(4) = 4, 

*2 = / 12 ( 4 ) = 4. 

Note that both hash functions deliver the same value, but since we maintain 
dedicated arrays of counters for each hash function this is not a problem. 
Thus, we increment counters c\ and c| and the CMSketch is as below. 


h 

h'2 

The next three elements are all equal to 4, hence we update the same 
counters: 


h 

h'2 

Next element in the dataset is 2 and its corresponding indices are i\ = 4 
and in. = 1, so we increment counters cf and c{. Note that there is a soft 
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collision and element 2 changes the counter that is also used by element 
4. This makes the value in the counter cf overestimate the real value for 
both elements. 


h 

h 2 

In the same way, we process all remaining elements and update counters 
c\ and c| for element 3, counters c\ and C 2 for element 5, and c\ and cf 
for element 6. Note, that both counters for element 6 collide with other 
elements, thus we can expect that its value will be overestimated. 

The final CMSketch data structure has the following form: 


h 

h 2 


Every time element x is indexed, the same counters c^- ^ are 
incremented for each row j of the sketch and since they are never 
decremented, those counters provide the upper bound for 
the frequencies: 

f(x)<c^ (x \j = l,2,...,p. 

While counters cannot underestimate the real frequency f(x), they 
generally overestimate it because m <C n and there are a lot of collisions 
such that hj(x ) = hj(y ) for x / y, meaning that when element y is 
indexed into CMSketch the counter for element x is also incremented. 

As a resuit, there are p estimations that suffer from a one-sided error 
(all of thern are overestimations of the real value). The usual procedure 
for building a better approximation from a number of estimations is 
averaging, but this error can make the estimation even worse. Obviously, 
the best estimation in this case is the smallest one. 
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Algorithm 4.10: Estimating frequency with Count-Min Sketch 
Input : Element x G B 

Input : Count-Min sketch with p x m counters 
Output : Frequency estimation 

/ := tt lLl 

f or j 4— 1 to p do 


i ■(— hj(x) 



return min(/i,/ 2 ,---,/ P ) 


The minimum of p elements can be found in linear time, and therefore 
the running time of the frequency estimation procedure given by 
Algorithm 4.10 is O (p), the sarne as for an update. 


Example 4.9: Frequency estimation with the Count-Min Sketch 

Consider the CMSketch data structure that we built in Example 4.9: 


h 

h 2 

Let’s estimate the frequency of element 4 whose corresponding counters 
are cf and cf, as we determined earlier. Using Algorithm 4.10, as 
the estimation we calculate the minimum of those counters: 

f = min(cf, cf) = min(10, 7) = 7. 

Thus, the estimated frequency of element 4 is 7, that is also the correct 
count from the dataset. 

Now, consider element 6 with corresponding counters c\ and cf. However, 
as we already noted in Example 4.9, both of them are also used by other 
elements due to collisions. Thus, the frequency estimation for element 6 is 

/ = min(ci, cf) = min(8,4) = 4, 

that significantly overestimates the real value of 1. If we want to maintain 
better accuracy and make such collisions rare, we need to ha ve more hash 
functions and counters that increase the computational time and storage. 
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Knowing how to estimate the frequencies of elements, lets the Count- 
Min Sketch algorithm determine the most frequent elements. Similar to 
the Count Sketch, the simplest approach requires the maintenance of 
a set of candidates for the most frequent elements as well as the main 
CMSketch data structure. Then, we go over the data stream and 
update the sketch with all the elements seen thus far. If the element is 
not in the set that stili contains less than k elements, we simply add 
it. However, if the set is at its maximal capacity, we add the current 
element only if its estimated frequency exceeds the minimum frequency in 
the set by replacing the element with the smallest frequency. In the end, 
elements in the X* are considered the most frequent elements in the data 
stream. 

In a similar way, the CMSketch data structure can address the Heavy 
hitters problem, described earlier. In a single pass through the data 
stream, additional to the regular p x m array of counters C and p hash 
functions, we allocate a single counter N that stores the number of 
elements seen thus far, and maintains a heap X* of up to k potential 
heavy hitters. We use frequency threshold /* = ? to decide whether 
element is a heavy hitter. For every element x in the data stream, we 
execute an update procedure followed by the frequency estimation, and 
if f(x ) > /*, then element is qualified as a heavy hitter candidate. If 
the element is not in the heap yet, we store it and its frequency together, 
otherwise we update the stored frequency with the new value. 

The counter N increases with every processed element, when this grows, 
the estimated frequency for some elements in the heap become less than 
f* and these elements must be removed from the heap at every step. At 
the end of the processing, all elements in the heap are considered heavy 
hitters. According to the definition, there are at most k heavy hitters in 
the data stream. 
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Algorithm 4.11: Getting heavy hitters with the Count-Min Sketch 
Input: Data stream D 

Input: Count-Min sketch with p x m counters 
Output: Heavy hitters 
N <— 0, X* <r- 0 

for iGDdo 
N <- N + 1 

Update(x) 

/ Frequency(x) 

f* , _ N 

J \ k 

if / > f* then 

L X*^X*U{(x,/)} 
for (x*,f) € X* do 
if / < f* then 
L X* ■(— X* \ {(x*,f)} 

return X* 


Maintaining a heap for the e-Heavy hitters problem with e = ^ 
requires 0(log additional work per element. 


Example 4.10: Heavy hitters with Count-Min Sketch 

Consider the same setup as in Example 4.9 and search for k = 3 heavy 
hitters while processing the dataset. 

{4,4,4,4,2,3,5,4,6,4,3,3,4,2,3,3,3,21. 

According to Algorithm 4.11, additional to the CMSketch data structure 
we create a counter N of processed elements and a heap X* that Stores 
up to k heavy hitters candidates. We will skip the details of updating 
counters and frequency estimation because these steps are the same as in 
examples ab ove. 

We start consuming the dataset and the first element is 4, so we increment 
the corresponding counters cf and c|. 
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12 3 4 

hi 

h-2 

At this point, we have processed N = 1 element, so the threshold f* for 
the heap X* is |. The frequency estimation for element 4 from CMSketch 
is 1 and this is above the threshold, hence we add this element and its 
frequency to the heap: X* = [(4,1)]. 

Next element is 4 again and we increment the same counters. 


h 

ha 

However, since we already processed N = 2 elements, the threshold f* is 
changed to The current frequency estimation for element 4 is 2 which 
is stili above the threshold and since the element is already in the heap we 
just update its frequency: X* = [(4,2)]. 

In a similar manner we process the next 14 elements (up to N = 16). 
There are no changes to the number of elements in the heap and element 
4 is the only lieavy hitter candidate so far: X* = [(4, 7)]. The CMSketch 
data structure has the following form: 


h 

ha 

Next element in the dataset is 3, whose counters c| and cf we increment. 


h 

h2 

At this moment, there are N = 17 processed elements, so the frequency 
threshold is /* = « 5.33. The estimated frequency of element 3 is 

/ = min(7, 6) = 6 that is above the threshold, therefore we add it to 
the heap: X* = [(4, 7), (3, 6)]. All elements in the heap have large enough 
frequencies, hence we don’t remove any of them. 
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The last element in the dataset is 2 whose frequency is below tlie threshold 
/* = T = 6- Therefore, there are no changes to the heap at this step and 
the final list of heavy hitters is 

X* = [(4,7), (3,6)]. 


Properties 


The Count-Min Sketch is approximate and probabilistic at the same 
time, therefore two parameters, the error e in answering the paricular 
query and the error probability 8, affect the space and time requirements. 
In fact, it provides the guarantee that the estimation error for frequencies 
will not exceed e • n with probability at least 1-6. 

Similar to the Count Sketch, the increasing number of hash functions 
p decreases the probability of a bad estimate. For the desired Standard 
error 8, the recommendation for the number of hash functions that 
correspond to the rows in the CMSketch data structure is 


P = 



(4.3) 


The bigger m, the less likely collisions will happen, thus 
the overestimation error e • n will be lower. At the same time, 
with bigger p more estimations are used to calculate the final minimal 
value, which makes it more reliable. Thus, the recommendation on 
the number of counters m is 

[2 .718281 

m ~ - 

£ 

and comparison with (4.2) shows that the Count-Min Sketch is more 
space-friendly than the Count Sketch. 

Since the CMSketch data structure consists of a two-dimensional 
array sized p x m and uses p hash functions, it requires O (m ■ p + p) 
space, assuming that every hash function is stored in 0(1) space. 
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Example 4.11: Estimate required space 

According to condition (4.3), to have the Standard error 8 around 1%, 
at least p = |~ln qT_"| = 5 hash functions are required. For instance, we 
expect 10 million (n = 10 7 ) elements to be indexed and allow the fixed 
overestimate of 10. Thus, we need e = = 1CT 6 and the recommended 

number of counters is 


m = 


2.71828 

IO - 6 


2718280. 


Thus, the CMSketch data structure needs to keep the counter array sized 
5 x 2718280 and, having 32-bit integer counters, the whole data structure 
requires 54.4 MB of memory. 


Two Count-Min sketches of the same size can be easily merged together 
by simple matrix addition resulting in a data structure for the union of 
their datasets. As a resuit, the Count-Min Sketch is useful in MapReduce 
and parallel streaming tasks for Big Data applications. 


Big data is characterized by a large amount of data that comes at high speed, 
which makes space and update time significant. Fortunately, the practical 
implementations of the Count-Min Sketch consume only up to a few 
hundreds of megabytes of memory and can handle dozens of millions of 
updates per second. 

The Count-Min Sketch is widely used for tasks on trafhc analysis and 
in-stream nrining applications running on distributed stream processing 
frameworks including Apache Spark, Apache Storm, Apache Flink, and 
others. There are also implementations for popular databases such as 
Redis and PostgreSQL. 


Conclusion 

In this chapter we discussed the problem of determining frequencies of 
elements in continuous and potentially infinite streams, these often have 
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to be processed by Big Data applications. We started with formulating 
many important frequency-related problems that can be solved using 
the data structures and algorithms from this chapter. Starting with 
the very simple problem of the majority element, we moved onto learning 
how to solve the very complex problems of finding the most frequent 
elements and heavy hitters. 

If you are interested in more information about the material covered 
here or want to read the original papers, please take a look at the list of 
references that follows this chapter. 

In the next chapter, we continue working with data streams and 
consider probabilistic algorithms that can be employed to compute rank 
characteristics such as quantiles and their particular types including 
percentiles and quartiles. 
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5 

Rank 


Large volumes of unstructured data easily overwhelm the human ability 
to understand it and makes data summarization by computing statistical 
quantities one of the most necessary tasks to perform with data. In 
this chapter, we investigate algorithms and data structures to calculate 
rank-based characteristics of the data using a small arnount of mernory 
and one pass through the data. 

The most commonly used rank characteristics are quantiles. Formally, 
the q-quantile (0 < q < 1) is an element of the sequence where a q fraction 
of elements fronr the sequence are less or equal to it, and the remaining 
(1 — q) are greater or equal. Moreover, if the sequence has n elements, we 
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say that a (/-quantile element is an element of the sequence those rank is 
q ■ n. Percentiles are just quantiles that divide the sorted sequence into 
100 equal parts, hence the 95 th percentile is the same as the 0.95-quantile. 
The 0- and 1-quantiles are the minimal and the maximal elements in 
the sequence, respectively. The 0.5-quantile is known as the rnedian. 


As was proven by Ian Munro and Michael Paterson 1 , to find a particular 
quantile exactly in p passes through the data requires O(n?) memory. 
This means that any one-pass algorithm cannot guarantee to produce 
the precise value of the quantile in sublinear space. This moti vates a search 
for algorithms that compute approximate quantiles. 

In practice, having an error in a quantile calculation is often tolerable 
because they are usually estimated for noisy input data and approximate 
the unknown data distributions. Thus, in most cases we are interested 
in the e-approximate q- quantile, meaning an element with its rank in 
[(<7 - e) • n, (q + e) ■ n\, where n is the nurnber of elements and 0 < e < 1 
is an error parameter. Note that more than one element could qualify. 

Estimation of various rank characteristics like quantile summaries plays 
an important role in streaming outlier detection methods. For instance, 
if we monitor Online e-commerce transactions to detect credit card fraud, 
we are interested in unusual payment locations, those that don’t fit in 
the 99 th percentile of usual location distribution for our customers. 


Example 5.1: Fraud detection (Perlich et al., 2007) 

Financial fraud remains one of the most critica! issues facing the financial 
industry. For instance, in 2015, global credit and debit card fraud resuited 
in losses amounting to $21.84 billion 2 . 

Many applications have been built to search for and identify the signs 
of financial fraud. Such applications frequently use numerous specific 
variables whose “degree of outlyingness” is examined for every observation. 
For instance, variables such as the total amount spent on a credit card 


1 Selection and sorting with limited storage, Theoretical Computer Science, Vol.12 (1980) 
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and the amount spent per day can be used. 

For every observation, the degree of outlyingness can be approximated by 
the quantiles of some spending distribution. Thus, the suspected fraudulent 
observations can be identified as outliers througlr comparison to some high 
quantiles, e.g., the 0.95-quantile. 


Another huge applicatiori domain of rank summaries is web traffic 
monitoring. Investigating the summaries means problems can be detected 
early, without inspecting the actual data. 


Examplc 5.2: Website monitoring (Buragohain & Suri, 2009) 

Big websites handle millions of users every single day. For instance, in 
September 2017 Wikipedia processed about 500 million hits per day 3 
across all its languages, that is roughly 5.7 thousands requests per second, 
using more than 300 servers around the globe. 

One of the most critical issues in a website’s performance is latency, 
the delay between when the content was created and the time it was 
transferred to the visitor. Since the distribution of the latency values 
is typically skewed, the monitoring usually is built by tracking some 
particular high-quantities or percentiles. The most common questions are: 

• What is the latency for 95% of requests for a single web server? 

• What is the latency for 99% of requests for the entire website? 

• What was the latency for 95% of requests for the entire website in 
the last 15 minutes? 

While all these questions can be answered with the quantile computation, 
technically they have differences that might require the application of 
different methods. For instance, while for the first question, a summary 
can be computed per single stream, the second question requires distributed 
algorithms that can compute statistics on many streams’ data. In contrast, 
the third question requires only a subset of the streanTs data defined by 
a time window and such a subset will always change. 


Credit Card Sz Debit Card Fraud Statistics https://wallethub.com/edu/statistics/25725/ 
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The task to find (/-quantile, or, in other words, elements from a sorted 
sequence of n elements whose rank is q ■ n, where q G (0,1), is called 
a Quantile query. The Median query is a special case of the quantile 
query with q = 0.5. 

The problem of quantile calculation is not new and is already well 
developed in classical computation. However, it has new challenges for 
unbounded streams, which are common for Big Data applications, when 
limited mernory is available, and only a single pass through the data is 
possible. The Count-Min Sketch algorithm, previously introduced in 
Chapter 4, allows for the computation of approximate values of 
quantiles but requires much more mernory than the algorithms that will 
be discussed in this chapter. 

Alternatively, we can search for the rank of the given element in 
a sorted sequence of n elements, known as an Inverse quantile query. 
With rank (a;) and the total number of elements n, it is easy to compute 
the corresponding quantile q: 

q = — ■ rank(i). 
n 

For many applications, it is also important to find the number of 
elements from a sorted sequence of n elements that are in some given 
range [a, b], often referred to as a Range query. In fact, to calculate such 
a number, it is enough to compute the ranks of the range’s boundaries 
and return their difference. 

In this chapter, we start with a randomized sampling algorithm, then 
continue with a simple tree-based q-digest and, finally, study the modera 
t-digest algorithm that uses clustering for efficient estimation of rank- 
based statistics in unbounded streams. 


'Wikipedia Page Views https://stats.wikimedia.org/EN/TablesPageViewsMonthlyCombined.htm 
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5.1 Random sampling 

The random sampling technique, selecting without replacement 
a random subset of the data, can be found in many algorithms in 
computer Science. For rank problems, this technique can be used to 
report quantiles computed on samples, as an approximation to 
the quantiles of the whole data stream. 

The distinet advantage is that such samples are much smaller, in fact, 
often rank quantiles queries can be answered using classical deterministic 
algorithms. However, to have some prior guarantees on the error of such 
an approximation, the random sample has to be taken in a special way, 
which could even be data dependent. 

Additional problem that may occur with classical sampling is that 
many sampling schemas require prior knowledge of the size of 
the dataset, that is problematic for the continuous streams often used in 
Big Data applications. One of the possible Solutions is the simple 
reservoir sampling technique, developed by Jeffrey Vitter in 1985, that 
allowed for the generation of a sample without such knowledge, but if 
we wanted to apply it directly to the Quantile problem the memory 
requirements would be quite significant. 

The Random sampling algorithm, often referred to as MRL, was 
published by Gurmeet Singh Manku, Sridhar Rajagopalan, and Bruce 
Lindsay in 1999 [Ma99] and addressed the problem of the correct 
sampling and quantile estimation. It consists of a non-uniform sampling 
technique and deterministic quantile finding algorithm. 

To support continuous data streams processing with little space 
requirements, Manku et al. suggested a non-uniform modiheation of 
reservoir sampling where elements that appear earlier in the sequence 
are included with higher probability than others. Such a modiheation 
has better space-efficiency and considerably more accurate than 
the original reservoir sampling. 

The main disadvantages of the MRL algorithm are that its conhguration 
parameters are determined by solving a complicated optimization problem 
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and it uses some complex procedures. In this section, we investigate 
a simpler version of the MRL algorithm that was proposed by Ge Luo, 
Lu Wang, Ke Yi, and Graham Cormode in 2013 [Wal3], [Lul6], and 
denoted in the original articles as Random. 

The Random algorithm processes data from the data stream in chunks 
of variable sizes and performs a sampling on them that produces the non- 
uniform sampling at the end. 

In order to store samples of the elements, the algorithm maintains 
a data structure SampleBuffers that consists of b simple data units 
Bi, B 2 ,... B^, called buffers, each of these stores at most k elements and 
can be associated with some level L at which it was populated. 

The level parameter L reflects the probability that the elements are 
drawn and depends on the number of elements n that have been processed 
so far and the maximum allowed height h of the tree that represents the 
sequence of operations carried out the algorithm: 


L = L(n, h) = max 




n 1 \ 
k ■ 2 h - 1 )' 


(5.1) 


where L(0, h) = 0. 

To populate an empty buffer B^, * G 0... 6 at level L, we choose k 
random elements from k ■ 2 L consequent input elements, one per block 
of 2 l , and store them in B^. At the end of the procedure, the buffer 
might have less than k elements because the input sequence did not have 
enough elements, but if at least one element is in the buffer, it is labeled 
as full. 


The probability that a particular element from the incoming data stream 
is selected and stored into a buffer directly depends on the level L since it 
Controls the size of the chunk 2 L from which elements are drawn. This is 
the practical implementation of the non-uniform sampling used in 
the algorithm. 
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Algorithm 5.1: Populating empty buffers 
Input: Data stream D 
Input: Empty buffer B L of size k at level L 
Output: Populated buffer B L and its label 
f or i 0 to k - 1 do 

S <r- next(2 L ,D) // read next 2 L elements from D 

if S = 0 then 
L break 

X <— sample({s G S}) // randomly choose one element from S 

B L B L U {r} 

label 4— empty 
if count(B L ) > 0 then 
| label <— fu.ll 

return B L , label 


Two buffers from the same level L can be collapsed, merged to reclaim 
buffer space, whiclr results in a new buffer of the same size at level L + 1. 
To collapse two buffers, we sort the sequence of the elements from both, 
and randomly select half of the elements, for example, by choosing all 
the elements at either odd or even positions. The collapsed buffers are 
marked as empty , and the output buffer as full. 


Algorithm 5.2: Collapsing two non-enrpty buffers 

Input: Non-enrpty bufferts B^, B^ of size k at level L 

Output: Populated buffer B L+1 at level L and its label 

S <— sort(B^ U B^) 

free(B^) 

free(B^) 

B l+1 •(— sample(S, k ) // randomly choose k elements from joined buffers 

return B L+1 ,full 


The collapse operation requires O (k ■ log k) for sorting the buffers and 
the subsequent buffer population can be performed in O (k) time. 

Finally, the process of building the SampleBuffers data structure 
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consists of a series of buffer population steps and collapse operations. 

We start with every buffer labeled as ernpty. Processing of the input 
stream starts with setting the active level L using the formula (5.1), this 
is equal to zero at the beginning since there are no processed elements 
yet. If there is an empty buffer B, we populate it using Algorithm 5.1 by 
reading k ■ 2 L elements from the stream. When all buffers become full, 
we find the lowest level that contains at least two buffers and collapse 
two that have been randomly selected. 

The total number of collapse operations is O ()}) throughout the entire 
data stream, which is about 0(1) for each update. The sorting takes 
0(logA;) for each update. Thus, the amortized time is 0(log k). 


Example 5.3: Build Sample Buffers 
Consider a dataset of 25 integers: 

{ 0 , 0 , 3 , 4 , 1 , 6 , 0 , 5 , 2 , 0 , 3 , 3 , 2 , 3 , 0 , 2 , 5 , 0 , 3 , 1 , 0 , 3 , 1 , 6 , 1 }. 

To illustrate the process of handling a data stream, we use the height 
h = 3 and maintain b = 4 buffers: Bi, B 2 , B 3 , B 4 of k = 4 elements each. 
Thus, simplifying the formula (5.1), the active level can be calculated as 

L = L (n) = max (0, |~log(n) - 4]). 


In the beginning, the number of processed elements n = 0, hence we start 
from L = 0 and read the first Ni = 4 elements from the input stream 
{0, 0, 3,4}, and populate an empty buffer, say Bi. Since the capacity of 
the buffer is also 4, we don’t need to draw random elements and all of 
them are stored. 


b 'i 

b 2 

b 3 

b 4 

0 

0 

3 

4 














Secondly, we again need to define the active level. The number of processed 
elements is n = N} = 4 and the active level remains zero: L = L(4) = 
max(0, 2-4) = 0. We read the next N 2 = 4 elements {1, 6 ,0, 5} and in 
the same way, populate buffer B 2 . 
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B? 


b 3 

b 4 

0 

0 

3 

4 

1 

6 

0 

5 










Thus, we have already processed n = Nd + N 2 = 8 elements, the current 
level L = L( 8 ) = max(0, 3-4) = 0, and we index the next N 3 = 4 elements 
{2, 0, 3,3} to buffer B 3 . 


Bi 

B^ 

B 3 

b 4 

0 

0 

3 

4 

1 

6 

0 

5 

2 

0 

3 

3 






Likewise, after processing n = Ni + N 2 + N 3 = 12 elements, the active 
level is stili zero, we again read the next N 4 = 4 elements {2, 3, 0, 2} and 
populate the only remaining empty buffer B 2 . 


B? 

B^ 

B^ 

b 4 

0 

0 

3 

4 

1 

6 

0 

5 

2 

0 

3 

3 

2 

3 

0 

2 


At this point we have no empty buffers left, hence we need to perform 
the collapse operation. The lowest layer that has at least two buffers is 
level 0 from which we randomly select two buffers, for instance, B° and 
B°. First, we merge ali elements from these buffers and sort them: 

{1, 6 ,0,5} U {2,0,3,3} = {1,6,0,5,2,0,3,3} ->• {0,0,1,2,3,3,5, 6 }. 

Next, we free buffers B° and B 3 , and populate buffer B 3 at level 1 with 
50% of their former elements, for simplicity let’s take the odd elements. 


B? 

b 2 

b 3 

B^ 

0 

0 

3 

4 





0 

1 

3 

5 

2 

3 

0 

2 


Thus, we have already processed n = N 4 +N 2 +N 3 + N 4 = 16 elements, but 
the active layer remains zero, and we populate B 2 with the next N 5 = 4 
elements from the data stream: {5, 0,3,1}. 


B? 

B^ 

B 3 1 

B*i 

0 

0 

3 

4 

5 

0 

3 

1 

0 

1 

3 

5 

2 

3 

0 

2 


Once again there are no empty buffers, thus we need to perform another 
collapse. 
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Level 0 contains three full buffers and we randomly choose two of them, 
e.g., B 4 and B 4 , then merge and sort all their elements: 

{0,0,3,4} U {2,3,0,2} = {0,0,3,4,2,3,0,2} -> {0,0,0,2,2,3,3,4}. 

We label buffers B^ and B 4 as empty and populate buffer B 4 at level 1 
with 50% of their elements by taking elements in even positions. 


Bi 


B3 1 

b 4 





5 

0 

3 

1 

0 

1 

3 

5 

0 

2 

3 

4 


At the next step, we have already processed n = N 4 + N 2 +N 3 + N 4 + N 5 = 
20, thus the active level L = L(20) = 4.32 -4= 1, and we read the next 
Ng = 4 • 2 1 = 8 elements from the data stream. In this case, there are 
not enough elements left in the data stream, we read {0, 3,1, 6,1} and 
populate Bi by sampling one element from each group of two elements. 


b| 

B^ 

Bs 1 

B, 1 

3 

1 

1 


5 

0 

3 

1 

0 

1 

3 

5 

0 

2 

3 

4 


Finally, we have built the resulting data structure SampleBuffers. 

With SampleBuffers it is possible to answer the Inverse quantile 
query and the rank of the given element x can be estimated as weighted 
by the layer sum of counts of elements smaller than x for each non-empty 
buffer: 

k 

rank(x) = ^ 2 L( - B ^ • |{e < x\e € (5.2) 

i=l 


Example 5.4: Inverse quantile query with Random sampling 
Consider the data stream from Example 5.3 and perform the Inverse 
quantile query to estimate the rank of element 4. 

The data structure SampleBuffers has the following form. 


Bi 

B^ 

B3 1 

B, 1 

3 

1 
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0 
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1 

0 

1 

3 

5 

0 
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3 
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Using the formula (5.2) we calculate the rank: 

rank(4) = 2 1 • 3 + 2° • 3 + 2 1 • 3 + 2 1 • 3 = 21. 

Thus, for element 4 the estimated rank(4) = 21. 

To answei' the Quantile query and find the quantile from 
the SampleBuffers data structure, we simply need to search for 
an element whose estimated rank, derived from formula (5.2), is closest 
to q ■ n. In fact, we need to ask a number of Inverse quantile queries for 
each of the elements in the data structure, but we can use the binary 
search to speed up the process, and stop as soon as we find a value that 
is close enough. 


Example 5.5: Quantile query with Random sampling 

Consider the data stream from Example 5.3 and calculate the 0.65-quantile. 

The total number of elements in the data structure SampleBuffers is 
n = 25, so our boundary value is q ■ n = 0.65 • 25 = 16.25. 


Bt 

B^ 

B3 1 

Bl 

3 

1 

1 


5 

0 

3 

1 

0 

1 

3 

5 

0 

2 

3 
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There are elements {0, 1,2,3,4,5} in the SampleBuffers. We start with 
element 0 and estimate its rank that, according to the formula (5.2), equals 
to zero: rank(0) = 0. Next, we clreck element 1 and its rank estimation is 
rank(l) = 5. The rank of element 2 is rank(2) = 12, while rank(3) = 14. 
And we already know from Example 5.4 rank(4) = 21. Finally, the rank 
of element 5 is rank(5) = 23. 

Thus, the closest element to the boundary value 16.25 is element 3 with 
rank(3) = 14. We report element 3 as an approximation of the 0.65- 
quantile. 

Note that we could speed up the process by using a binary search over 
the sorted number of elements, taking into account that rank is a monotonic 
function of its argument. 
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Properties 

To compute the e-approximate of g-quantile, the Random requires a 
fixed arnount of memory that is proportional to b ■ k and depends only 
on e. With given approximation error, we can have the computation tree 
with height h = log *, and the optimial nurnber of buffers is 

b = log - + 1, 
e 

while the size of each buffer is 



Being probabilistic, the Random algorithm correctly reports quantile 
approximates with constant error probability that is bounded by and 
originates frorn random sampling and random merging steps. 


5.2 q-digest 

Quantile digest , or q-digest , is a tree-based stream summary algorithm 
that was proposed by Nisheeth Shrivastava, Chiranjeeb Buragohain, 
Divyakant Agrawal, and Subhash Suri in 2004 [Sh04] for use in the context 
of monitoring distributed data frorn sensors. 

The q-digest addresses the quantile computation problem as 
a histogram problem when data are summarized by some nurnber of 
buckets. The algorithm maintains a set of such buckets in a tree-like 
q-digest data structure, merges small buckets, and splits the big ones. 
It is a lossy deterministic algorithm, however, we consider it very useful 
and important for our narration. 

The algorithm works with integer values within some known range. 
The binary partition of the integer range [0, N - 1] can be represented as 
a virtual full and complete binary tree, whose root element corresponds to 
the whole range [0, N - 1], its left and right children have ranges Fo, I 11 
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and + 1, N - 1 , and, iteratively, the leaf nodes represent single 


integer values. The depth of the tree is logN. 


Every node v in such a binary tree is a bucket that has an associated 
range [v m ; n ,v max ]. Additionally, we associate counters v count for each 
bucket to represent the nurnber of elements (including duplicates) that 
are indexed in it. 


Example 5.6: Binary partitioning for q-digest 

Consider a dataset of n = 20 integers from range [0, 7] that we investigated 
in Example 5.3: 

{ 0 , 0 , 3 , 4 , 1 , 6 , 0 , 5 , 2 , 0 , 3 , 3 , 2 , 3 , 0 , 2 , 5 , 0 , 3 , 1 }. 


By binary partitioning the range we build the following binary tree and 
bucket the input data: 



/\ /\ 
06 12 23 35 



/\ /\ 
4i 5 2 61 7 


The leaf nodes from left to right represent elements from [0, N ■ 1] and 
the index numbers indicate the frequencies of the elements in the dataset. 


Thus, the internal representation of the data consists of 
the frequencies with which the stored elements were observed. At worst, 
storage limitations mean we have to store such data as O(n) or O(N), 
whichever is smaller. Note that in practice such binary trees are likely 
to be very sparse and imbalanced, therefore storing it in raw form 
without compression is quite inefficient. 

The q-digest algorithm proposes a way to compress and compactly 
store such a binary partition tree. Its data structure Q-digest encodes 
information about the distribution of elements and represents a version of 
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the binary tree that includes only those buckets v that satisfy the following 
digest property: 

| Vcount < |*J > (except leaf buckets) ^ 

\vcount + ^count + v count > UJ > (except the root) 

where v p is the parent and v s is the sibling of v; n is the total number of 
elements, and a 6 [1, n] is a design parameter responsible for the level 
of compression. 

The exception from this property is the root and leaf buckets. The root 
bucket can violate the digest property (5.3), however, stili be included in 
the Q-digest data structure. The leaf buckets with counts bigger than 
the boundary value |?J (frequent elements) are included as well. 

In fact, the digest property defines a compromise between including 
a few top-level and broad buckets, and many small buckets that contain 
information about a few non-frequent elements. 

Simplifying, the first constraint in the digest property (5.3) excludes 
buckets unless they are leaf nodes which contain counts about high- 
frequency elements because for such buckets it is worth storing child 
elements and having more precise counters. 

On the other hand, according to the second constraint, if two adjacent 
buckets, which are siblings, have low counts, then we want to avoid 
having two separate counters for each of thern and it is better to merge 
thern into their parent and achieve the required degree of compression. 

Thus, the construction of the Q-digest requires hierarchical merging 
and reduction of the buckets, going through all buckets bottom-up and 
checking if any of thern violate the digest property. In practice, since we 
are only going bottom-up the second constraint to be checked. 

Except for the root bucket, for every bucket v that violates the digest 
property, we merge its subtree by compressing counts from it, its parent 
v p and sibling v s , and promote thern to the parent bucket: 

p _ p S 

^count ''Qount > v count T" ^count > 

while excluding the bucket v and its sibling v s from the q-digest. 
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Algorithm 5.3: Compressing q-digest 

Input: q-digest data structure of a range [0, N - 1] 

Input: Compression factor o 

Output: Compressed q-digest data structure 

level -e- log N - 1 

while level > 0 do 

for v E Q-DlGEST[lerel] do 

if ^count d - ^count d~ ^count — LaJ then 
^count ^ ^count d~ ^count d - v C ount 
Q-DIGEST E- Q-DIGEST \ {v, V s } 

level <— level - 1 

return Q-DIGEST 


The compression takes 0(m • logN) time, where m = |q-digest| is 
the number of buckets in the data structure; thus, the theoretical update 
cost per element is about O(logN). In practice, however, the update 
takes more time because every element is inserted into the leaf node first, 
and then, during the compress operation, the algorithm needs to find its 
appropriate position in the Q-digest by moving the element up one step 
at a time. 

Example 5.7: Compress tree with q-digest 

Consider the dataset of n = 20 elements from Example 5.6 where 
the frequencies for non-observed buckets default to zero. 

[ 0 , 7] 0 


[0, 3] 0 

[4,7] 

0 

/ \ 

____z 

\____ 

[0, l]o 

[2, 3] 0 

1 [4,5]o 1 

1 [6,7] 0 1 

/\ 

/\ 

: /\ : 

: /\ : 

06 12 

23 35 

1 4i 52 1 

1 61 7o 1 


Let’s assume we want to achieve a compression with a = 5, then 
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the boundary value is 


"1 - 

20 

LoJ ~ 



Going bottom-up, consider the fourth level first, where only buckets from 0 
to 3 satisfy the second condition of the digest property (5.3). According to 
Algorithm 5.3, the children of buckets [4, 5] and [6, 7], that together violate 
the digest property, have to be merged to their parents and excluded from 
the Q-DIGEST. 


Thus, the Q-DIGEST at this stage is (buckets in solid-line boxes are included 
in the compressed data structure): 



Further, at the third level, all buckets violate the constraints (5.3). 
Therefore, we also compress them to their parents and don’t include in 
the Q-DIGEST: 



At the second level, we check the digest property for two children of 
the root bucket, which again violate the constraints (5.3) since their total 
counts do not exceed the boundary value and, consequently, they have to 
be merged to the parent. 

For the root element, it is not necessary to check the digest property since 
it, as we declared earlier, is always included in the compressed Q-DIGEST 
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if it has non-zero associated counts. 

Hence, the final version of the compressed Q-DIGEST data structure is as 
follows: 

[0, 7] 4 


[0,3] [4,7] 




[0,1] [2, 3] 



[4,5] 


/\ 


/\ 


6 7 


As we can see in this example, the compressed Q-DIGEST data structure 
requires storing only fi ve buckets with non-zero counts. 


Because we always go bottom-up (and never top-down), check the digest 
property, and make the decision about merging buckets only once during 
the procedure, it is not necessary that all buckets from the compressed 
q-digest satisfy the digest property after compression. For instance, changes 
(e.g., merging to a parent) in some buckets on the top levels of the tree 
could make already included buckets violate the constraints of the digest 
property (5.3). However, in practice, this behavior does not decrease 
the accuracy of the algorithm and, at worst, produces a less optimal data 
structure that consumes more memory than is theoretically expected. 

Putting that all together, we can formulate the complete q-digest 
algorithm for the arbitrary dataset as below. 


Algorithm 5.4: q-digest algorithm 

Input : Dataset D with elements from range [0, N — 1] 

Input : Compression factor o 

Output: Compressed q-digest data structure 

Q-digest -e- BinaryPartitionTree(D, [0,N- 1]) 

return Compress(Q-DlGEST, N, a) 
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To optimize the representation of the Q-digest data structure, 
the buckets in the associated binary tree can be enumerated in 
a right-to-left, top-to-bottom manner: 

Figure 5.1: Buckets enumeration 


i 



4 5 6 7 



As soon as all buckets v are enumerated, it is easy to restore 
the corresponding range [v m i n ,v max ] even if we only know its index i. 

Algorithm 5.5: Restoring bucket range [v m i n ,v max ] 

Input: Bucket index i 
□utput: Bucket range 
level [log(i)J 

Tl i — 2 leLd ^ // number of buckets on the level 

Tfl i — i mod n // position of the bucket on the level 

return ["— • ml, — ■ (m + 1) I 

In this way, we can build a linear representation of the Q-digest data 
structure — an array of buckets, where each bucket is just a 2-tuple 
of its number and the associated counts. For example, the compressed 
Q-digest from Example 5.7 has the following linear representation: 
((1,4),(8,6),(9,2),(10,3),(11,5)). 

Two q-digests with the same compression factor o and element ranges 
can be easily merged which allows for the processing of Big Data streams 
in a distributed fashion. The idea is to take the union of their sets of 
stored buckets and add the counts of the buckets with the same range, 
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sum the total number of elements, and afterward run the compression 
algorithm. 

The q-digest algorithm can be used to answer the Quantile query and 
find the (/-quantile from the Q-digest data structure. At first, we obtain 
a sorted sequence S by ordering the buckets in increasing order of their 
Vmax values, breaking ties with smaller values first. After that, we can 
scan the sequence S from the beginning and add the counts of buckets as 
they are seen. As soon as for sorne bucket v* this sum, that is the rank 
estimation for the bucket, becomes larger than q ■ n, its v* iax is reported 
as the estimate to the (/-quantile. 


Algorithm 5.6: Answering Quantile queries with q-digest 
Input : q-digest data structure 
Input : Value q £ [0,1] 

Output : (/-quantile 
S e- sort (q-digest) 
rank 4— 0 

for (v,count) £ S do 
rank <— rank + count 
if rank > q • n then 
S_ return v max 


There are at least q ■ n buckets whose max values are less than v* 
therefore the rank of bucket v* is at least q ■ n. 


It is possible to have an error in calculation of the r-approximate of q- 
quantile if values less than vjj, ax are present in the ancestors of bucket v* 
because in this case they will not be counted by Algorithm 5.6. Analytically, 
such an error is bounded by e • n and the algorithm reports a rank in 
[q ■ n, (q + e) ■ n] interval; thus, it never underestimates the exact value of 
the (/-quantile. 
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Example 5.8: Quantile query with q-cligest 

We perform the Quantile query to calculate the 0.65-quantile from 
the Q-DIGEST data structure built in Example 5.7 whose linear 
representation has the following form: 

((1,4),(8,6),(9,2),(10,3),(11,5)). 

Thus, the sorted sequence of buckets is 

S = ((8,6), (9,2), (10,3), (11,5), (1,4)). 

According to the algorithm, going from the beginning, we sum counts 
of the buckets until the total becomes larger than 0.65 • n = 13. In 
the current Q-digest, we exceed that boundary value at the bucket (11,5) 
that corresponds to leaf element 3. 

Thus, the q-digest estimation of the 0.65-quantile (or 65 th percentile) for 
the dataset of Example 5.7 is the element 3. 


In a similar manner can be addressed the Inverse quantile query. We 
build a sorted sequence S of the buckets and traverse it from the beginning 
while maintaining the running sum of counts from seen buckets. The rank 
of the given element x can be estimated as the sum of the counts of 
the buckets v for which x > v max . 


Algorithm 5.7: Answering Inverse quantile queries with q-digest 

Input : Element x 

Input : q-digest data structure 

Output : Rank of element 

S sort (q-digest) 

rank •(— 0 

for (v,count) E S do 
if x > v max then 

rank •<— rank + count 

return rank 


As in the Quantile query above, the rank obtained here lies within 
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the interval [rank(x), rank(x) + e ■ n\. 

As we already mentioned, to answer the Range query it is enough to 
perform two Inverse quantile queries to find the ranks and the difference 
between the range borders a and b. The maximum error for the Range 
query in q-digest can be estimated as 2 e ■ n. 


Algorithm 5.8: Answering Range queries with q-digest 
Input: Range [a, b\ 

Input: q-digest data structure 
Output: Number of elements in range 
r a InverseQuantiIeQuery(a, Q-digest) 
rb •(— InverseQuantileQuery( 6, q-digest) 
return rb - r a 


Properties 

The q-digest algorithm is a lossy algorithm; it compresses information 
about low-frequency elements while carefully preserving information 
about high-frequency ones. Therefore, it provides a good approximation 
schema when there are wide variations in frequencies of different elements. 
The q-digest algorithm can provide information about the distribution of 
elements values, but not the information concerning where those values 
have occurred. 

There is a ciear trade-off between the accuracy of the algorithm and 
the memory required to store the Q-DIGEST data structure, that is 
controlled by the conrpression factor o. Thus, for the given range [0, N], 
we can expect at most 3 • a stored buckets and the error in e-approximate 
(/-quantile computation is upper bounded: 

log N 

£ < - . 

CT 


The core property of the q-digest is that it is adaptive to the data and 
builds buckets of almost equal weights. In contrast to the traditional 
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histogram, q-digest allows overlapping buckets which makes it possible 
to answer consensus queries (e.g., the frequent values). 

The main problems in practical applications of the q-digest algorithm 
is that it can handle only integer elements, requires their range being 
known in advance, and suffer from significant errors for extreme quantiles. 


5.3 t-digest 

One of the modern alternatives to the accurate online accumulation of 
rank-based statistics is called t-digest and was proposed by Ted Dunning 
and Otmar Ertl in 2014 [Dul4]. The t-digest algorithm allows estimating 
quantiles in unbounded streams with a focus on extreme values such as 
0.99-quantile. This is ongoing research and the algorithm periodically 
gets improvement updates based on its practical applications [Dul8]. 

The t-digest summarizes the input data stream D in varying-sized 
clusters that allows it to maintain a good accuracy in quantile 

computation while processing a large amount of data. Every such cluster 
Ci represents a subset of input elements and sized to ensure it is not too 
large to be able to estimate quantiles by interpolation, but not too small 
to prevent ending up with too rnany clusters. 

Every cluster C* is defined by the centroid c t , a data point at 
the center of the cluster, that is the mean of the input elements that 
contribute to this cluster, and the number of such elements c? ount . 
The t-digest data structure is an array of such weighted centroids 
{(ci, cf> unt ), (c 2 , c^ ount ),... (c m , C unt )} that are sorted in ascending 
order. From this sorted sequence we can estimate the maximal quantile 
value that corresponds to each centroid c t : 

<?(c J ) = -E c r nt + - c * OUnt , (5.4) 

n J n 

j<i 

m 

where n = c| ount is the total number of indexed elements in the data 
i=i 


structure. 
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Thus, according to (5.4), every cluster in t-digest data structure 
is responsible for a certain range of quantile values (q(ci-i), q(ci)], 
whose length depends on the cluster size, the number of elements that 
contribute to this cluster. Correct cluster sizing has a direct influence on 
the accuracy and in the t-digest algorithm it is provided by 
a non-decreasing scale function. Such a function k = k(q,o) takes into 
account desired compression o and scales quantile values q differently 
based on how far their are from the extrema such as q = 0 and q = 1. 
The good choice of the scale function is crucial and there are alternative 
functions with different trade-offs in ternis of accuracy [Dul8a]. For 
instance, one of commonly used functions is 

k(q, o) = — arcsin (2q - 1), (5-5) 

2n 

where the compression parameter o > 1 (bigger values correspond to less 
compression). 

With respect to the chosen scale function k = k(q,a), for every cluster 
C t that is associated with its centroid c, in the t-digest data structure, 
we can dehne the k-size , denoted as K(cj), which expresses the scaled 
length of the quantile range for the cluster: 

K (cj) := k ( q(ci),a) - k (g(cj_i),o) ,i = 2...m (5.6) 

where K(ci) := k o). 

To restrict the number of elements in a cluster in a way that depends 
on the quantile values it is responsible for, we can restrict its k- size 
and with non-linear scale functions we resuit in non-uniform clusters, 
having larger cluster sizes for the middle-range quantiles and smaller 
near the extrema (up to singleton clusters that contain only one element). 
Moreover, the t-digest algorithm is designed to build a fully-merged 
t-digest data structure, meaning all clusters {Cj}"h 1 satisfy the digest 
property: 


K(ci) < 1, (except singleton clusters) 

K(cj) + K(c i+ i) > 1, 


(5.7) 
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which not only restricts the k -size of each cluster, but also ensures that 
any two adjacent clusters cannot be further merged. 

In practice, to fulfill the constraints of the digest property (5.7), we 
do not need to recompute the k -size for each cluster on every change. 
Instead, since all centroids are sorted in the t-digest, the number of 
elements in the cluster C t can be limited by choosing a boundary for its 
estimated maximal quantile value: 


fflimit = k 1 ( k(q(ci ), o) + 1, a), 

that for the scale function (5.5) has the following forni: 

_ 1 
^limit — 2 


1 + sin ( arcsin (2 • q(ct) - 1) 4- 


(5.8) 


(5.9) 


Having the rules (5.8) to restrict the number of elements per cluster in 
t-digest, we can formulate the merging t-digest algorithm as 
Algorithm 5.9, which is similar to the regular clustering procedure. To 
summarize input sequence of weighted data points 
{(mi, mf ount ), (x' 2 , a ^ ount ),.. .(xb, x“ mit )}, we sort them together with all 
centroids from the t-digest data structure and, making a single pass 
through the resulting sequence X, we attempt to merge them 
successively if the digest property is not violated. We start with 
the left-most centroid, take its cluster as the current candidate cluster, 
and compute its boundary value qiimi t by (5.8). Then, sequentially 
Processing all centroids from X, we estimate their approximate quantile 
values and compare them to the boundary value. If absorbing of 
the pending centroid does not exceed the boundary value, we merge it 
into the candidate cluster and continue with the next centroid from 
the sequence X. Otherwise, meaning the maximual capacity of 
the candidate cluster is reached and no new elements can be added, we 
persist the current candidate cluster in the t-digest data structure, 
emit a new candidate cluster with the pending centroid, and recompute 
the quantile boundary value At the end, we receive a fully-merged 

t-digest data structure. 



Algorithm 5.9: Merging elements to t-digest 


Input : Buffer B with elements {(xi, xf° unt ), (x 2 , X 2 ° unt ),... (xj, x£ ount )} 

Input : t-digest data structure t-digest 

Input: Compression parameter o > 1, scale function k 

Input : t-digest data structure with rnerged buffer 

X e- sort (t-digest U B) 

T-DIGEST e- 0 

m 4- count(X), n <r- J2 x? ount 

Xi£X 

c <r~ xi, q c <r~ 0 

fflimit k(k(q c , o) + 1, a) 

for i •(— 2 to m do 

, _ „ i j_ „count , j_ count 

Qc n C n X i 

if q< fflimit then 
I ^count ^_ ^count _|_ j ,count 


L continue 

T-DIGEST •<— T-DIGEST U {(c, C c 

q c ^q c + ^c count 

fflimit •«- k _1 (k(q c , o) + 1, a) 


T-DIGEST i — T-DIGEST U {(c, £ coun ^)). // i as t cluster usually is a singleton 


return T-DIGEST 


As we can see, every time when a new candidate cluster is emitted 
the boundary value fflimit has to be recomputed which involves expensive 
computat ion of the scale function and its inverse, according to (5.8) 
and (5.9). Luckily, the number of clusters is not too large in practice and 
various techniques have been suggested in order to optimize the boundary 
estimation, such as using efficient approximations for components of scale 
functions or roughly estimating the maximal number of elements that 
can be summarized into each cluster. 

The complete algorithm to index a continuous data stream is based on 
the idea of processing streaming data by buffers of some fixed size and 
continuously merge thern into the t-digest data structure. 
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Algorithm 5.10: Stream processing with t-digest 
Input : Data stream D = {x\, X 2 , ■ ■ ■,} 

Input : Buffer size b, compression parameter ct > 1, scale function k 
Input : t-digest data structure 
T-DIGEST 0 

while D do 

B {(xi,l),(x 2 ,l),...,(x 6 ,l)} 

T-DIGEST Merge(T-DIGEST, B, CT, k) 
return T-DIGEST 


Note, the runtime costs of the buffer-and-merge Algorithm 5.10 are 
shared between frequent inserts of input elements into the buffer and rare 
calls of Algorithm 5.9. Since the inserts are cheap, the overall costs are 
dominated by the sort and the scale function invocations in the merging 
sub-algorithm that are amortized over several insertions. 


Example 5.9: Indexing data stream with t-digest 

Consider the dataset of n = 20 integers from Example 5.6: 

{ 0 , 0 , 3 , 4 , 1 , 6 , 0 , 5 , 2 , 0 , 3 , 3 , 2 , 3 , 0 , 2 , 5 , 0 , 3 , 1 }. 


As an example, let’s take the compression parameter a = 5, the buffer size 
b = 10, and the scale function as given by (5.5). 


We populate the buffer B with the first ten elements from the input: 

B = ((0,1), (0,1), (3,1), (4,1), (1,1), (6,1), (0,1), (5,1), (2,1), (0,1)). 

According to Algorithm 5.10, we need to join elements from the buffer and 
the centroids that are in the t-digest. However, since the t-digest data 
structure is empty, the list of candidate centroids X contains only n = 10 
elements from B, which we sort in ascending order: 

X = ((0,1), (0,1), (0,1), (0,1), (1,1), (2,1), (3,1), (4,1), (5,1), (6,1)), 

We select our first candidate cluster by taking the left-most centroid (0,1) 
and compute its quantile boundary value Qh m it using the formula (5.9): 


9limit — 


1 + sin ( arcsin (-1) + — 
V 5 


= 0.34549. 
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Then, we take the next element from X, whicli is again (0,1), and evaluate 
if it can be merged to the candidate cluster without violating the digest 
property. In our case it requires to estimate the maximal quantile value q 
of this merged cluster: 

(1 + 1 ) 


q = 


io 


= 0 . 2 , 


and because it is below the quantile boundary value we can freely 

merge the element (0,1) into the current candidate cluster, which now 
has c count = 2 elements but, since the elements are identical, the centroid 
stays the same c = 0. 

Similarly, we are able to merge the next element (0,1) to the candidate 
cluster, that changes only the number of summarized elements c count = 3. 

Next, we get the fourth element from X, which is (0,1), and follow the same 
procedure as above to check if it can also be merged into the candidate 
cluster. However, the estimated maximal quantile q of such merged cluster 
will become: 

(3 + 1) 


q = 


io 


= 0.4, 


that exceeds the current boundary value qnmit = 0.34549. Thus, we stop 
our attempts to absorb other centroids by the candidate cluster, store it 
into the t-digest data structure 


T-DIGEST = ^(0,3)^, 


and remember its maximal quantile value as q = 


= 0.3. 


From this moment we start to build a new candidate cluster from 
the current pending element (0,1), having c = 0 and c COUIlt = 1, and its 
quantile boundary value is 


1 


®imit — 


1 + sin ( arcsin (2 • 0.3 - 1) + — 

5 


= 0.874025. 


Next, we take element (1,1) and if we merge it to the current candidate 
cluster, the estimated maximal quantile q will be 

q = 0.3+ = 0.5, 


10 
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that doesn’t exceed the current boundary value. Thus, we merge 
the element (1,1) to the current cluster whose counts increase to 
c count _ 2 an q the centroid becomes 


Similarly, we process all the rest of the elements from X and the T-DIGEST 
data structure grows into: 

T-DIGEST = <J(0, 3), (2, 5), (5,1), (6,1)). 

Continue processing the dataset, we fili a new buffer: 

B = ((3,1), (3,1), (2,1), (3,1), (0,1), (2,1), (5,1), (0,1), (3,1), (1,1)), 

join it with the T-DIGEST data structure, and sort the resulting sequence 
X in ascending order of centroids: 

X = ((0,3), (0,1), (0,1), (1,1), (2,5), (2,1), (2,1), 

(3,1), (3,1), (3,1), (3,1), (5,1), (5,1), (6,1)). 

Having that, we flush the T-DIGEST data structure and starting from 
the left-most elements attempt to merge sequentially elements and store 
clusters into T-DIGEST that they cannot be further merged. 

At the end, the resulting T-DIGEST data structure consists of m = 5 
clusters and has the following view: 

T-DiGEST = ^(0.1667,6), (2.36364,11), (5,1), (5,1), (6,1)). 


Because we lost Information about exact elements clustered together 
(except the singleton clusters, where the centroid is the initial element), 
the T-DIGEST data structure provides a lossy representation of the data 
stream and to estimate quantiles and answer the Quantile query, we need 
to make an interpolation, taking into account distribution of clusters 
that has been produced by the chosen scale function. 
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Algorithm 5.11: Answering Quantile queries with t-digest 
Input: t-digest data structure with m clusters 
Input: Value q G [0,1] 

Output : (/-quantile 

m 

n <r- E c. count 
3 =1 

if n ■ q < 1 then 
| return ci 

if n ■ q > n - ^c™ unt then 
[_ return c m 

/* at this point, we can be sure that 3i G [1 , m) : q{c{) + ^ > <?> */ 

/* so the searched quantile is somewhere between c* and Ci+i */ 

if c 4 count = 1 and q(a) > q then 
return c, 

if c" = 1 and q(a+ 1 ) ~\<q then 
j return Cj +1 

Aieft <- (c“ = 1) ? 1 : 0 

Aright <- (c^r = 1) ? 1 : 0 
Wieit <- n - q-n- q(ci ) + A- 

r count_A . 

Aright i- n • g(c*) - n • q + * +1 2 riglt 
return Ci ' Wri ght+ <h+ 1 • meft 

^left t ?, 'rioht 


Thus, to find the g-quantile from the t-digest data structure, we 
calculate the rank of the searched element x in this sorted sequence, 
which is n ■ q, where n is the total number of elements summarized into 
t-digest data structure. If this rank is below one, we report the centroid 
ci as the quantile. Similarly, if the rank is within a half of the last cluster 
of the the maximal count or even above, we return c m that the maximal 
element in the digest. Otherwise, we search for clusters C* and C t+ \ whose 
estimated quantile values given by formula (5.4) encircle the given quantile 
value q. When the left cluster Ci is a singleton and its maximal quantile 
surpasses q, we return centroid c* as the quantile that is the element 
that was actually summarized into the cluster. Similarly, if the right 
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cluster Cj+i is a singleton and its estimated minimal quantile value falis 
behind the given quantile value q, meaning this cluster is responsible for 
the searched quantile, we report centroid c l+ \ as the best estimation to 
it. Otherwise, we compute weights by evaluating the contribution of each 
such cluster and build an interpolation by taking the weighted average 
of the centroids frorn both clusters, which is reported as the ^-quantile. 


The quantile estimation algorithm is dependent on the choice of the scale 
function and with more aggressive functions that produce a bigger tail of 
singleton clusters at the edges, it can be tuned to improve accuracy for 
extreme quantiles [Dul 8 ]. Additionally, it is advised to persist the minimal 
and maximal elements during the indexing to use them in interpolation. 


Example 5.10: Quantile query with t-cligest 

We perform the Quantile query to calculate the 0.65-quantile from 

the T-DIGEST data structure built in Example 5.9: 

T-DIGEST = ^(0.1667,6), (2.36364,11), (5,1), (5,1), ( 6 ,1)^. 

The total number of elements in the T-DIGEST is the sum of counts from ali 
clusters, that is n = 20 in our case. The rank of the searched quantile x is 
n-g = 20-0.65 = 13, which is neither smaller than one, nor too close to n , 
therefore, we start searching for two consecutive clusters Ci and Ci+i , whose 
centroids will encircle the searched quantile x. In the current T-DIGEST 
data structure, these are C 2 and C 3 , because 


q(c 2 ) + 


x ^count 

2l20 C3 


6 + 11 
20 


1 

40 


= 0.875 > 0.65, 


as it required according to Algorithm 5.11. 

Since the cluster C 3 is a singleton, meaning it has Cg ount = 1, we need to 
compare its maximal quantile value to the searched quantile value q in 
order to check if its centroid can be the best fit. Thus, we calculate 


<?(c 3 ) + 


1 

20 


6 + 11 + 1 
20 


1 

20 


0.95, 


that is significantly surpasses the value q = 0.65 and we conclude that 
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the actual quantile is located somewhere in between the centroids c 3 and 
c 2 , but remember that the right cluster is a singleton by setting A r i g ht = 1 . 

Therefore, the searched quantile can by estimated by the interpolation 
using the weighted average of the centroids, where the weights are 

c count — Q 6+11 11 

wieft = 20 • 0.65 - 20 • q(c 2 ) H—-— - = 13-20 • + — = 1.5, 

„count _ 1 6+11 1-1 

u+ght = 20 • q(c 2 ) - 20 • 0.65 + - 3-2 - = 20 • —- - 13 + — = 4. 

Finally, the estimated 0.65-quantile for the dataset of Example 5.9 is 
_ c 2 • w rig ht + C 3 • Wieft _ 2.36364 • 4 + 5 • 1.5 _ 0 no 

X — — — o.Uo, 

^left + ^right 1.0 -|- 4 

which is pretty close to the exact value of 3 for that dataset. 

Similar to the Quantile query, we can use the t-digest data structure 
to answer the Inverse quantile query and find the rank of some given 
element x. We start with the comparison of the element x to the minimal 
and maximal centroids in the t-digest data structure, which are the left- 
rnost and right-most clusters, accordingly. If x falis outside that range, 
we just report either 1 or the total number of elements n as the estimated 
rank value, depending on the side the element appear. Otherwise, we 
search for the element x through the centroids and if such clusters are 
found, we accumulate their counts and report the rank(x) as the rank of 
the cluster with the smallest index adjusted by that arnount. If neither 
of the checks above are succeeded, we can be sure that the element x 
falis in between centroids of some consecutive clusters, say (ci,Cj+i), 
and its rank is already at least n • q(ci). If both of these clusters are 
singletons, meaning their centroids are exactly the input elements that 
were summarized, we do not need to correct that value and return it 
as the searched rank. When only one of those clusters is a singleton, 
we fine-tune the rank by the scaled contribution of another cluster to 
get the final value. Otherwise, we build an interpolation using the both 
cluster sizes and adjust the guaranteed rank value. 
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Algorithm 5.12: Answering Inverse quantile queries with t-digest 
Input : Element x 

Input : t-digest data structure with m clusters 
Output : Rank of element 

m 

n <r- E c. count 
3 =1 

if x < ci then 
| return 1 

if x > c m then 
[_ return n 

/* check if x is one of the centroids */ 

if 3j : Cj = x then 

J e- {j : Cj = x}, i* <— rnin(J) 

return n ■ q(ci*) - cf? unt + | J2 c| ount 

ieJ 

/* at this point, we can be sure that 3i G [1 , m) : x G (c*, */ 

rank 4— n ■ q{ci) 

if c 4 count = 1 and = 1 then 

L return rank 


if C) count = 1 then 
j_ return rank + § • c?°" nt 

if c£pj nt = 1 then 

return rank - ■ c£ ount 

return rank + | • c£°“ nt - • c? ount 

As mentioned earlier, to answer the Range query, it is simple enough 
to perform two Inverse quantile queries and find the difference between 
the ranks of the range borders. 


Properties 

There is a ciear trade-off between the size of the t-digest data structure 
as controlled by the compression parameter o, the speed, and the accuracy 






5.3 t-digest 


159 


to which the quantiles are estimated. Thus, with a smaller value of o 
and a large buffer size b, we can achieve higher speed with constant 
memory usage. For highest accuracy, it is preferred to use larger o to 
have less compression and a bigger buffer (e.g., 10 x a), while for the 
smallest memory — a smaller buffer and larger values of the compression 
parameter o. 

As shown by the t-digest authors, when using the scale function (5.5), 
the number of clusters m in the t-digest data structure that satisfies 
the digest property (5.7) and indexed n > | elements is in the range of 

^ <m<|"o|. (5.10) 


Example 5.11: Estimate required space 

For example, we want to index at least n = 1000 elements with 
the compression parameter a = 100. Therefore, according to (5.10), we 
can expect from 50 to 100 clusters in the fully-merged t-digest. 

In the t-digest data structure each cluster is represented by its centroid 
and the number of indexed elements. Thus, having 32-bit counters and 
double precision 64-bit floating point number for the centroid value, the 
entire centroid requires 12 bytes of memory and the whole data structure 
fits in about 1.2 KB of memory. 

For high accuracy we typically use buffer ten times bigger the compression 
parameters and having b = 10 • a = 1000 we can allocate smaller 16-bit 
counters and double precision 64-bit floating point numbers for elements 
in the buffer, that end up in additional 10 KB of memory in runtime. 


The t-digest algorithm maintains accuracy e in g-quantile estimation 
that is proportional to q-(l-q) and, in contrast to other algorithms which 
maintain only the constant absolute error, in the t-digest the relative 
error is bounded that rnakes it resistant to significant errors for extreme 
quantiles. The advantage of the t-digest over the q-digest is also that it 
can handle floating point values while the q-digest, as we have already 
seen, is limited to integers only. 
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Two t-digest data structures can be easily merged using the same 
algorithm, but the resuiting data structure is not the same as 
the t-digest built for the joined input strearn. However, the empirical 
results show that it provides a good estimation to that value, so it is 
possible to compose t-digests for different parts of the data stream in 
parallel and combine them to answer rank queries. This makes 
the algorithm parallel friendly and useful in MapReduce and stream 
mining tasks for Big Data applications. 

The t-digest algorithm has become more and more popular these days. 
For instance, it is used in the percentiles aggregation in Elasticsearch 
and also available in stream-lib and Apache Mahout. 


Conclusion 

In this chapter we covered efficient algorithms and data structures that 
are widely used to calculate rank-based characteristics of the data using 
a small amount of memory. We studied a popular sampling algorithm, 
well-known tree-based stream summary algorithm as well as its modern 
alternative that is based on one-dimensional clustering. With these 
algorithms we can find ranks of elements in a data stream, various 
quantiles and execute range queries. 

If you are interested in more information about the material covered 
here or want to read the original papers, please take a look at the list of 
references that follows this chapter. 

In the next chapter we consider the similarity problem, one of 
the fundamentals in data analysis. We study different similarity 
definitions and efficient probabilistic algorithms that approach 
the problem of ascertaining the most similar documents for a given 
document across huge datasets. 
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6 

Similarity 


Similarity is a fundamental data analysis problem that has attracted a lot 
of research effort in the last two decades. While talking about relations 
of two documents 1 , we are mostly interested in concepts such as roughly 
the same and in finding a way to express similarity numerically. 

The similarity plays an important role for Big Data applications and 
can be used to reduce the processing time and computation efforts. For 
instance, with its help, we can eliminate data that has already been 
processed even if it doesn’t have the same form as before. Another 
example is the development of different sampling techniques to handle 


1 “Documents” can be objects of any nature, e.g., texts, images, etc. 
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large volumes of data, that are sometimes unfeasible to process. When 
handling data from a number of classes, instead of just taking every n-th 
document from the dataset (which can resuit in unbalanced processing 
of the classes), we can develop a similarity measure to group documents 
of one class together and process equal subsets from each class to keep 
the processing balanced. 


Example 6.1: DNA sequences (Xie et al., 2015) 

The rapid development of DNA sequencing technologies in recent years 
has led to a huge number of discovered DNA sequences. Evaluation 
of the similarity between them is a crucial starting point for analyzing 
genomic information and has a wide range of applications. However, DNA 
databases have a huge number of documents, where the sanie data can 
be stored in various different forms and an efficient search for similar 
sequences is essentia!. 


The most well-known similarity-related problem is to find a nearest 
neighbor for a given document, meaning the document that is most similar 
to it across the dataset. Having an efficient algorithm for the nearest 
neighbor search in a large database can speed up, by several orders of 
magnitude, many important applications like document retrieval, image 
matching, etc. 

The naive solution is to use a linear scan, iterate over all existing 
documents, and compare them to the given document. Such an approach 
guarantees to find the exact nearest neighbor of any query object, but 
requires O (n) time, where the number of pairs n is huge. In high- 
dimensional spaces, the problem of the nearest neighbor search becomes 
even more difficult. 

Thus, we are looking for sublinear time Solutions that approximately 
find the nearest neighbor, that is suitable in most practical cases. In 
practice, we are interested in solving an approximate nearest neighbor 
problem or, more formally, a z-Nearest neighbor problem to find with 
some high probability 1 - e the nearest neighbor for a given document in 
a large database. 
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The immediate applicatiori of the nearest neighbor search is 
the detection of duplicates (exact and non-exact), a task to find 
documents that at sorne level are similar to the given document. 


Example 6.2: Intellectual property (Broder et al., 1997) 

The detection of duplicates, illegal copies, or modifications, is very 
important in intellectual properties protection and plagiarism prevention. 

Given a source document, we can perform a nearest neighbor search to 
find other documents that are similar to it, in whole or part, that have 
been substantially copied or minorly edited. 


Another important application of the nearest neighbor problem is 
clustering, a task of grouping documents in a way that documents in 
the group (cluster) are more similar to each other than to other documents 
outside the group, in other words, to group the nearest documents 
together. 

Conceptually, to find similar documents in a dataset, we need to 
compare each document to each document, which requires the evaluation 
of about the quadratic number of pairs. Thus, for 1 million documents 
there are about 500 billion (5 • 10 11 ) pairs and, judging 10 6 pairs per 
second, it takes almost six days to process all those documents, which is 
unpractical. 

Since the similarity problem itself is fuzzy, it is natural to use 
probabilistic algorithms to solve it fast and efficiently. 

Jaccard (resemblance) similarity 

While it is not immediately ciear how to express the similarity between 
documents of arbitrary representation, mathematics has already 
developed a solid theory for set similarity. Thus, representing documents 
as collections of some features, the document similarity problem can be 
converted to a set intersection problem and evaluated, for instance, by 
a randorn sampling that can be done independently for each document. 
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There are rnany different ways to represent a document of any nature 
as a set. Generally speaking, we need to identify important document 
characteristics that describe it in the best way and represent the document 
as a simple collection of those features. To be able to compare documents 
with higher effectiveness, it is important to define a canonical collection of 
features which stay the same for documents that differ only in information 
that is usually ignored as meaningless (e.g., for text documents, we often 
ignore punctuation, capitalizations, formatting, and so on). The step of 
the preprocessing of documents to their canonical form is called document 
normalization. 


Example 6.3: Features for music tracks 

In the task of finding audio matches, we want to use features that are 
robust to the common types of abuse that are performed on audio before 
it reaches our ears. For instance, we can note the peaks in the spectrum 
and encode their positions in time and space as a collection of signatures 
that describe the particular audio. 

In contrast, for songs, we can extract features based on mel-frequency 
cepstral coefficients (MFCCs), which are a short-time spectral 
decomposition of a musical clip that conveys the general frequency 
characteristics important to human hearing. Representing a song as 
a collection of MFCC frames, we can consider two songs similar if they 
have the same frames regardless of the order. 


Another example shows how text documents can be treated. 


Example 6.4: Shingling technique for text documents 

For text documents, the most well-known method to represent them as 
collections of features is shingling , where shingle is a contiguous subsequence 
contained in a document. Specifically, every document can be associated 
with a collection of w-shingles, that includes all shingles of some predefmed 
size w contained in the document. 

For example, consider a text document “The quick brown fox jumps over 
the lazy dog”. We can build shingles of size w = 6 from the sequence of 
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characters which are 

“the qu”, “he qui”, “e quic”, “ quick”, “quick ”, “uickb”, “ickbr”, “ck bro”, 
“k brow”, and so on. 

Another approach is to use word tokenization, for tlrat our example can be 
reduced to a simple split of the document by spaces, and we build shingles 
from the sequence of words. For example, 3-shingles (3-grams) will be 

“the quick broum”, “quick brown fox”, “brown fox jumps”, “fox jumps over”, 
“jumps over the”, “over the lazy”, “the lazy dog”. 

Unfortunately, the length of the shingles can vary by a wide range, and it 
can be tough to allocate a space-efficient data structure. 

Instead, we can convert shingles to fixed-length entities by applying a 
classical lrash function tliat liashes to the desired nurnber of bits, e.g., 8-bit 
values. This approach has some additional tiny probability of collision, 
but can drastically reduce the required space. 


If two documents and de are represented as collections of features, 
we can mathematically calculate their resemblance as a Jaccard similarity 
J(c/a, dp) which indicates the ratio of common features in both documents, 
and produces a number between zero and one, such that it is close to 
one for the documents that are roughly the sarne: 


J(dA, dp) 


I d\ H dp | 
I d\ U dp | 


( 6 . 1 ) 


The Jaccard similarity of exact duplicates is equal to one, and we can 
consider documents as nearest neighbors if their resemblance exceeds 
a certain given threshold 0 < 0 < 1. 


In reality with high volumes of documents to compute the Jaccard similarity 
for, it sufhces to keep a relatively small fixed-size sketch for each document. 
Such sketches can be produced very fast (linear on document size) and, 
given two sketches, the Jaccard similarity can be computed in linear time 
based on the size of the sketches. 
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Example 6.5: Jaceard similarity 

Medical symptoms can be naturally used as features for diseases. Consider 
five well-known illnesses together with their most common symptoms 2 : 



Disease 

Symptoms 

di 

allergic rhinitis 

sneezing, itehiness, runny nose 

d- 2 

common cold 

runny nose, sore throat, headache, muscle 
aches, cough, sneezing, fever, loss of taste 

d 3 

flu 

fever, aching body, feeling tired, cough, sore 
throat, headache, difficulty sleeping, loss of 
appetite, diarrhea, nausea 

di 

measles 

runny nose, cough, red eyes, fever, greyish- 
white spots, rash 

d 3 

roseola 

fever, runny nose, cough, diarrhea, loss of 
appetite, swollen glands, rash 


Intuitively, we can expect that the common cold is a bit more similar to 
the flu , than to roseola ; roseola should be similar to measles, and allergic 
rhinitis should be quite different from the others. Let’s compute Jaccard 
similarities for these documents. 

Documents d 2 and d 3 have 14 different symptoms in total wliile sharing 
only 4 of them ( cough , fever, headache, sore throat ); thus, the similarity is 
equal to 0.2857, which is about 29%: 

J{d 2 ,d 3 ) = ij- = 0.2857. 

Next, we compare documents di and d 3 that have 9 different symptoms in 
total and 4 in common, so the similarity is 44%: 

J(dt, d 5 ) = - = 0.44. 

Comparing d\ to d 3 gives us no common symptoms, so J^, d 3 ) = 0 and 
they are two different diseases that cannot be accidentally mixed up. 


Once every document is represented as a collectiori of features, we 
have the set of all features across all documents which is called 


Find more conditions and treatments at NHS Choices https://www.nhs.uk 
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the universal set Q. This feature set can be seen as a bit-array, where 
set bits indicate that the corresponding feature from the universal set is 
present in the document. 


The universal set is usually much bigger than the collection of features 
from a particular document, therefore the document bit-arrays have much 
more unset bits that set ones (very sparse). 


Example 6.6: Document bit-array 

Consider the list of diseases from Example 6.5. The universal set for these 
documents includes all the different symptoms mentioned in the documents 
(in practice, it should consist of all possible medical symptoms). We can 
enumerate those symptoms in some particular order, e.g., alphabetically: 


Index 

Symptom 


Index 

Symptom 

0 

aching body 


10 

loss of taste 

1 

cough 


11 

muscle aches 

2 

diarrhea 


12 

nausea 

3 

difficulty sleeping 


13 

rash 

4 

feeling tired 


14 

red eyes 

5 

fever 


15 

runny nose 

6 

greyish-white spots 


16 

sneezing 

7 

headache 


17 

sore throat 

8 

itchiness 


18 

swollen glands 

9 

loss of appetite 





The bit-array that corresponds to document d 3 (flu) and the order of 
features has the following form: 


0 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

1 

1 

1 

1 

1 

1 

0 

1 

0 

1 

0 

0 

1 

0 

0 

0 

0 

1 

0 


Thus, the set bit in position 5 (corresponding to fever) means that it is 
a symptom for the flu, while the unset bit in position 13 indicates that 
rash is not a symptom. 


The Jaccard similarity between two document bit-arrays is the ratio 
between the number of bits that are set for both documents (i.e., have 
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ones in the same bit-positions) to the number of bits that are set for 
either one or the other document. 

A binary representation of the documents encodes only the fact of 
the feature’s existence in the document, but cannot answer questions 
about how frequently the feature appears and does not support feature 
prioritizing. For instance, in Example 6.5 many diseases have a cough , 
fever , and runny nose as symptoms because it is just the way our 
body protects itself regardless of the particular illness. However, this 
makes many different diseases a bit more similar to each other and to 
identify “truly” similar documents we need to use different approaches, for 
instance the TF-IDF model, which prioritize more unique terms between 
documents, and represents documents as dense vectors of the features’ 
weights. Unfortunately, the Jaccard similarity defined by (6.1) cannot 
be applied in this case, and we need to go for other similarity definitions, 
such as the Ruzicka similarity or the cosine similarity. 


Cosine similarity 

Another view on mathematical formalization of documents is to represent 
them as dense vectors of weighted features, where the weights could 
highlight the importance of the features. 

Text documents are the main targets of such formalization due to 
the popularity of the Vector Space Model 3 , that provides 
a representation for such documents as dense vectors of identifiers. For 
instance, the term frequency - inverse document frequency model 
(TF-IDF) considers documents as dense vectors of term weights, built 
as a relative frequency of the term in the document (term frequency, 
TF), normalized by the relative number of documents in the dataset 
that contains the term (inverse document frequency, IDF). 


G. Salton et al., A vector space model for automatic indexing, 1975 
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Example 6.7: Vector Space Modcl 

Consider the documents from Example 6.5 and let’s build a real-valued 
representation of them using the TF-IDF model. We treat every 
symptom Sj as a term and compute its weight Wj for the document based 
on the occurrence of the term in the dataset. The idea is to prioritize 
terms that occur more in the particular document but are very rare in 
the whole dataset, which could be an indicator that they better 
characterize the document. In our case, all symptoms occur exactly one 
or zero times in the documents, so instead of using the pure frequency, we 
use the features frequency adjusted for document length, the relative 
frequency f' 1 of the symptom in the document. To make the results more 
visual, we additionally scale the output and round the weights to integers: 

wj = 100 • f d • log —, 

«7 

where Uj is the number of documents that contain feature Sj and n is 
the total number of documents in the dataset. 

Like in Example 6.6, we can define a universal set f l and enumerate all 
the different symptoms alphabetically. Thus, we end up with 19 unique 
features which will induet the dimensionality of our document vectors. 


Feature 

Symptom 

Number of documents 

so 

aching body 

1 

Sl 

cough 

4 

S2 

diarrhea 

1 

S3 

difhculty sleeping 

1 

S4 

feeling tired 

1 

S5 

fever 

4 

Se 

greyish-white spots 

1 

S7 

headache 

2 

5 8 

itehiness 

1 

sg 

loss of appetite 

2 

SlO 

loss of taste 

1 

$11 

muscle aches 

1 

«12 

nausea 

1 

«13 

rash 

2 
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Feature 

Symptom 

Number of documents 

Sl4 

red eyes 

1 

Sl5 

runny nose 

4 

Sl6 

sneezing 

2 

Sl7 

sore throat 

2 

Sl8 

swollen glands 

1 


Consider document d^, (flu) and build its representation as a vector of 
weights of the document features. The feature So (aching body) is one out 
of the 10 features for document ds, its relative frequency is / 0 3 = ^ = 0 . 1 ; 
since no other document from the dataset contains that feature, n o = 1 
and the total number of documents n = 5: 

, 5 

Wq = 0.1 • log - ~ 16. 

Similarly, feature si ( cough) appears once in the document, so f-f = 0.1, 
but it is contained also in four documents in the dataset, so n\ = 4 and 
its weight is 

®i = 1 • log ^ « 2. 

We can continue processing all features and if some feature from 
the universal set is not present in the document, its weight is equal to 
zero regardless of other counts. 

Thus, the final real-valued vector representation of document d^ is 


16 

2 

9 

16 

16 

2 

0 

9 

0 

9 

0 

0 

16 

0 

0 

0 

0 

9 

0 


One of the popular similarity measures in the area of document vectors, 
the cosine similarity c(dA, dp), is the value of the angle a = a(dA, cfe) 
between two documents that are represented as two non-zero vectors: 

= = (6 ' 2) 

The cosine similarity focuses on the orientation of the document 
vectors, not on their magnitude. If two document vectors are orthogonal 
in the space (therefore, such documents are completely non-related), 
the angle between them is 90° and the cosine similarity is cos 90° = 0. 
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On the other hand, if the angle between the document vectors is close to 
0 °, the documents are roughly the same and their cosine similarity is 
close to one. 


Although the cosine function can take values from [1,1], in most 
information retrieval problems document vectors have only positive 
components, so the angle doesn’t exceed 90° and the cosine similarity only 
has values from [ 0 , 1 ]. 


Example 6.8: Cosine similarity 

Consider the RGB color space that defines the chromaticity of red (R), 
green (G), and blue (B). Every supported color can be represented with non- 
negative 8 -bit values of R, G, and B. For instance, red has the maximum 
value 255 in the R channel and zeros in other channels. 

Many people natively can identify similar colors, but let’s estimate their 
cosine similarities. Consider the list of colors in the table below. 



Color 

R 

G 

B 

d± 

red 

255 

0 

0 

d2 

dark red 

139 

0 

0 

d% 

ruby 

224 

17 

95 

di 

deep sky blue 

0 

191 

255 


Alternatively, we can draw them as position vectors in three-dimensional 
RBG space. 


B 
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We compute the similarity between documents d\ ( red ) and d 2 (dark red) 
which intuitively must be very similar: 


c(di, d 2 ) = 


255 • 139 + 0 • 0 + 0 • 0 
a/255 2 + 0 2 + 0 2 • \/139 2 + 0 2 + 0 2 


= 1 . 


Thus, according to the cosine similarity, the documents are exactly 
the sanie, this is due to the fact that for cosine similarity only 
the orientation is important (the fact, that both of them have values only 
in the R channel), but not the magnitude of the vectors (the actual value 
in the channel). 

Next, consider the document di (deep sky blue), that must be strong 
contrast to di (red), and compute the cosine similarity between them: 

cii '-= -255T255-= °- 

Definitely, these documents are orthogonal which confirms the zero cosine 
similarity. 

Now, consider the document d 3 ( ruby ), that has values in all channels, 
and find which color it is more similar to: 


c(d 3l di) = 


224 ■ 0 + 17 • 191 + 95 • 255 
v / 224 2_ +T7 2_ +1)5 2 • v / 0^ _ -M4R^ _ -F255 2 


0.32, 


and 


^ 224 • 255 + 17-0 + 95-0 n 

c(d 3 , di) = - _ — = 0.92. 

V224 2 + 17 2 + 95 2 • 255 

Thus, ruby is more similar to red than to deep sky blue, which is predictable 
since it is a representation of the color of the cut and polished ruby gemstone 
and is a shade of red. 


Now we study a generic framework for efficient searching of near- 
duplicate documents and then we go to its well-known implementations 
regarding different definitions of similarity. 















6.1 Locality-Sensitive Hashing 
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6.1 Locality-Sensitive Hashing 

Locality-Sensitive Hashing (LSH) was proposed by Piotr Indyk and 
Rajeev Motwani in 1998 [In98] as a family of functions with the property 
that similar input objects (from the domain of such functions) have 
a higher probability of colliding in the range space than dissimilar ones. 

Intuitively, the LSH is based on the simple idea that if two documents 
of any nature are close together, then, after applying those hash functions, 
the resulting hash values of these documents will rernain close as well. 


Locality-sensitive hash functions radically differ from conventional hash 
functions because they have the goal of maximizing the probability of 
a collision of similar items, while others try to minimize it. If we consider 
two documents that are different just by a single byte and apply any 
conventional hash function, for instance, MurmurHash3 or MD5, the hash 
values will be completely different, because the goal of those hash functions 
is to maintain a low probability of collision. 

In order to construet locality-sensitive hash functions that preserve 
similarity between documents, it is necessary to know how to measure 
such similarity Sim(dA, c?b) and distinguish similar objects using a certain 
threshold 0. 


The similarity measure slrould be clrosen based on the particular practical 
problem and different similarity measures induce different LSH function 
families. However, not every similarity measure can be used to build 
locality-sensitive hash functions, for instance, it has been proven that it is 
impossible to construet tlrem for such popular metries as the Dice coefficient 
and the Overlap coefficient. 


The locality-sensitive hash function h is a function that maps every 
document from the dataset, presuming similarity between documents in 
the way that the probability of collision P is higher for similar documents: 
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I Pr (h(d\) = h(d B )) > pi, if Sim(d A , d B ) > 6, 
jpr (/i(d A ) = h(d B )) < p 2 , if Sim(d A , d B ) < y0, 
where 0 < y <1 and 0 < p 2 < pi < 1. 

The closer y is to one, the better the function, the smaller error in 
the similarity detection. 



The Locality-Sensitive Hashing algorithm is a generic schema that 
solves similarity problems with the help of locality-sensitive hash functions 
that have been built for the chosen similarity measure. 

Algorithm 6.1: Locality-sensitive bucketing 
Input: Dataset D = {d\, d 2l ... d n } 

Input: Family of LSH functions H® im 

Output: LSH hash table with documents grouped into buckets 
T 4- 0 

h ~ H Sim 
for dsDdo 
key <— h(d) 

T (key) T (key) U {d} 

return T 


The simple idea is to map documents, using locality-sensitive hash 
functions, to a limited number of buckets where similar documents 
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appear in the same bucket with a higher probability. Such buckets can 
be organized in a hash table, where each of them is indexed by its hash 
value, and we can search for near documents via a hash table lookup. 

However, as we see from (6.3), locality-sensitive hash functions are not 
exact, which means false positive and false negative events can occur. 

False positive events in this generic schema occur when two dissimilar 
documents (whose similarity measure does not exceed the threshold 0) 
appear in the same bucket. This type of error can be eliminated by 
calculating the exact similarities for documents in the bucket and 
comparing them to the given threshold. 

More difficult are false negatives, when two similar documents end 
up in different buckets. This cannot be avoided, but to minimize their 
number we can build k different hash tables using randomly selected 
distinet LSH functions from the same family that map to the same set of 
bucket keys. In other words, we are increasing the number of estimators 
for each bucket that can boost the accuracy. 


Algorithm 6.2: Finding similar documents 
Input: Document d, dataset ID 

Input: LSH hash table T with documents grouped into buckets 
Input: Similarity threshold 0 
Output: Similar documents 
S <r- 0 

for key G T do 

if d T (key) then 
L continue 

for c G T (key) do 

if Sim(d, c) > 0 then 
L S g- SU{c} 

return S 


Since locality-sensitive hash functions focus on preserving similarities, 
we can expect that hash functions will map similar documents to 
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the same bucket at least once. The resulting buckets can be built from 
the documents that appear together at least once. 

Of course, this technique increases the number of false positive errors, 
but they can be eliminated with high confidence, as we described above. 


Algorithm 6.3: Locality-Sensitive Hashing algorithm 
Input : Document d 

Input : LSH hash table T with documents grouped into buckets 
Input : Similarity threshold 6 
Output : Similar documents 
S -e- 0 

for i i— 1 to k do 
|_ Tj <r- Bucket ing(B, Hg im ) 

T := U T, 

i= 1 

for key £ T do 

if d ({: T (key) then 
L continue 

for c £ T (key) do 

if Sim(d, c) > 0 then 
S<-Su{c} 

return S _ 

The performance of the LSH algorithm depends on a proper choice 
of 0 and k. Bad choices for these parameters could resuit in too few 
documents in the hash buckets leading to incorrect grouping, or too many 
documents leading to an increased time for exact similarity computation 
at the final step. 

The Locality-Sensitive Hashing algorithm is a framework to solve 
the Nearest neighbor problem, it has different implementations based on 
the chosen similarity measure. For instance, for the regular Euclidian 
distance it can be implemented as Random Projections, for Jaccard 
similarity as minwise hashing (MinHash), and for cosine similarity as 
SimHash that we study in detail in the next few sections. 
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Nearest neighbors search 

When we need to find the nearest neighbors for a given document frorn 
the dataset distributed into buckets by the LSH algorithm, we apply 
the same locality-sensitive hash functions to that document and get 
the number of relevant mapped buckets. Documents in those buckets 
are the candidates for the nearest neighbors, and we conrpute the exact 
similarity between them and the given document, filtering by 
the conrparison to the similarity threshold 0. 

In practical applications, there is a huge number of documents and 
the problem of searching in a LSH hash table becomes challenging. 
There are rnany approaches to handling it, but rnost of them introduce 
additional measures that can help to store hash table keys in an optimized 
order to improve the table lookup. 

For instance, SortingKeys-LSH , invented by Yingfan Liu et al. in 
2014 [Lil4], improves the search by minimizing random I/O operations 
when retrieving candidate documents. The authors defined a custom 
distance measure for the hash table keys and proposed to sort those keys 
in a special linear order associated with that distance. Following that 
order, the candidate documents can be stored closely in the nremory or 
on the disk. When a new document arrives, we need to retrieve only 
the documents for the close hashes according to the introduced distance 
measure, and can find the candidates faster due to the reduction of 
random I/O operations and higher search accuracy. 


6.2 MinHash 

The rnost well-known implementation of the Locality-Sensitive Hashing 
schema for Jaccard similarity is minwise hashing, or simply MinHash, 
proposed by Andrei Broder in 1997 [Br97], which includes a similarity- 
preserving hash function family and an algorithm for near-duplicates 
detection. Initially used in AltaVista search engine to detect duplicate 
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web pages [BrOO], today it is widely adopted in the search industry with 
numerous applications including large-scale machine learning systems, 
content matching for online advertising, and others. 

The idea is to represent documents as short fixed-length signatures 
while preserving the similarity and efficiently compare them. 


MinHash signatures 

For every document di in the forrn of a document bit-array, the MinHash 
value is the position of the left-most set bit, in sorne permuted order of 
the index (some order of the features). Thus, by each permutation 71 , we 
can define a different MinHash value min(Tt(dj)). 


Example 6.9: MinHash value 

Consider the document bit-array built for the document d 3 ( flu ) in 
Example 6 . 6 . 


0 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

1 

1 

1 

1 

1 

1 

0 

1 

0 

1 

0 

0 

1 

0 

0 

0 

0 

1 

0 


For instance, let’s take a random permutation of the index 0... 18: 

77 = {16,13,12,4,17,10,1,2,9,14,8,5,15,3,6,18,11, 7,0}, 
that corresponds to the following features order: 


Index 

Symptom 


Index 

Symptom 

16 

sneezing 


8 

itchiness 

13 

rash 


5 

fever 

12 

nausea 


15 

runny nose 

4 

feeling tired 


3 

difficulty sleeping 

17 

sore throat 


6 

greyish-white spots 

10 

loss of taste 


18 

swollen glands 

1 

cough 


11 

muscle aches 

2 

diarrhea 


7 

headache 

9 

14 

loss of appetite 
red eyes 


0 

aching body 































6.2 MinHash 


181 


Thus, the document bit-array, indexed in the permuted order is 


0 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

0 

0 

1 

1 

1 

0 

1 

1 

l 

0 

0 

i 

0 

i 

0 

0 

0 

i 

i 


After the bits are re-ordered according to the permutation n, the position 
of the left-most 1-bit for d 3 is 2, thus the MinHash value for the document 
c ?3 equals 2: 

min(n(d3)) = 2. 


Instead of relying on a single MinHash value, variability can be 
reduced by building the MinHash signature of length k for each 
document di, which is a vector of k MinHash values computed using k 
randorn permutations 7ii,7t2, • ■ ■ of the bit-array index. The length of 
the signatures k is independent of the size n of the universal set fi and 
has to be chosen based on the allowed probability of error and given 
similarity threshold. 

The list of signatures that have been built for each document {d*}” =1 
create a signature matrix MinHashSig k x n, that is the primary data 
structure of the MinHash algorithm. The rows in the signature matrix 
correspond to the permutations and the columns correspond to 
the documents. It is important to highlight here, that to build 
the signature matrix we must use the same collection of permutations 
and apply them in the exaet order. 

The signature matrix MinHashSig is a dense matrix with integer values, 
and the number of columns is equal to the number of documents in 
the dataset. However, the number of rows in the signature matrix is much 
less compared to the number of features in the universal set fi, so this is 
more storage-efficient than the binary representations of the documents. 

Unfortunately, in practice, it is unfeasible to permute a large index 
explicitly; even picking a randorn permutation of millions or billions 
of integers is time-consuming, and the additional necessary sorting of 
the index would take even more time. 
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For example, even for a very tiny webspam database of 350000 documents 
and 16 million features [Lil2], the preprocessing cost for 500 independent 
random permutations was about 6000 seconds. However, nowadays it is 
not rare to find a universal set with upwards of 1 billion features. To 
pick a random permutation of 1 billion elements is not only slow, but 
just the representation of the index using 32-bit integers requires 8 GB of 
memory to store just one permutation. 

Additionally, if the dataset does not fit into the main memory and we need 
to store it on disk, to access bits in a randomly permuted order will have 
the same disk issues as those discussed in the context of Bloom filters. 

However, we can simulate the effect of a random permutation by 
a random hash function that maps indices 0 ... m to exactly the same 
range. Some collisions can occur, but they are not important as long as 
k is big enough. For instance, we can use the family of universal hash 
functions h^ a b y(x), earlier defined by ( 1 . 2 ). 


Example 6.10: Permutation simulation 

Consider again the same bit-array that we built for document d 3 (flu) in 
Example 6 . 6 . 


0 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

1 

1 

1 

1 

1 

1 

0 

1 

0 

1 

0 

0 

1 

0 

0 

0 

0 

1 

0 


The universal set P for the documents has 19 features that are indexed 
by integers in the range 0... 18. To build signatures of length k = 4 for 
the document d 3 , we select four random hash functions from the family ( 1 . 2 ) 
that map every index position / £ 0 ... 18 to position hi(f) £ 0 ... 18, 
making the permutation of the index. In our case m = 19 and it is enough 
to choose p = M 5 = 2 5 - 1 = 31. 

hi(x) := ((22 • x + 5) mod 31) mod 19, 
h 2 (x) := ((30 • x + 2) mod 31) mod 19, 
h 3 (x) := ((21 • x + 23) mod 31) mod 19, 
hi(x) := ((15 • x + 6 ) mod 31) mod 19. 
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The corresponding permutations produced by the hash functions are 

h = {5,8,18,9,0,3,13,4,7,17,8,11,2,12,3,6,16,7,10}, 
h 2 = {2,1,0,11,10,9,8,7,6,5,4,3,2,1,0,18,17,16,15}, 
h 3 = (4,13,3,5,14,4,6,15,5,7,16,6,8,17, 7,9,18,8,10}, 
h 4 = {6,2,5,1,4,0,3,18,2,17,1,16,0,15,11,14,10,13,9}. 


Thus, instead of picking k random permutations, we simply compute 
h \, h' 2 ,..., hk random hash functions on the rows and build the signature 
matrix MinHashSig out of them. Note that we need only one pass 
through the data to build the signature matrix in this way. 


Algorithm 6.4: Building the MinHash signature matrix 
Input: Binary document-vectors {dj}” =1 
Input : Family of universal hash functions {hi }\ =1 
Input : Nurnber m of unique features in the universal set 
Output : MinHash signature matrix 
MinHashSig -e- oo 
for / <- 0 to m - 1 do 
for * ■<— 1 to k do 

L h i <- biif) 

for dj G ID do 

if d,j [/] yl 1 then 
L continue 

for i -G- 1 to k do 

L MinHashSig[? - 1, dj] •(— min(MiNHASHSiG[i - 1, dj], h{) 
return MinHashSig 


Suppose we have a million documents and use signatures of length 200, 
then using 32-bit integers to represent the values we need 800 bytes per 
document, the entire dataset requires about 800 MB of memory storage. 
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Example 6.11: MinHash signature matrix 

Let’s build a MinHash signature matrix with signatures of length k = 4 
using the permutations built in Example 6.10. Within the initial feature 
order, the bit-arrays for all documents are below. 



0 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

d\ 

0 

0 

0 

0 

0 

0 

0 

0 

1 

0 

0 

0 

0 

0 

0 

1 

1 

0 

0 

d 2 

0 

1 

0 

0 

0 

1 

0 

1 

0 

0 

1 

1 

0 

0 

0 

1 

1 

1 

0 

d 3 

1 

1 

1 

1 

1 

1 

0 

1 

0 

1 

0 

0 

1 

0 

0 

0 

0 

1 

0 

C ?4 

0 

1 

0 

0 

0 

1 

1 

0 

0 

0 

0 

0 

0 

1 

1 

1 

0 

0 

0 

d 3 

0 

1 

1 

0 

0 

1 

0 

0 

0 

1 

0 

0 

0 

1 

0 

1 

0 

0 

1 


In the beginning, all values in the signature matrix are not set, effectively 
we can fili them with oo: 



di 

^2 

d 3 

di 

d 3 

h 

00 

00 

00 

00 

00 

h 2 

00 

00 

00 

00 

00 

h 3 

00 

00 

00 

00 

00 

hi 

00 

00 

00 

00 

00 


The values of hash functions for the index value 1 are h{ = /ii(l) = 5 , 
h 2 = h 2 ( 1 ) = 2 , h\ = /13(1) = 4 , and h\ = /14(1) = 6. In the first 
position, only document d 3 has a 1-bit, thus we can update its signature 
values for each row, and the new values woulcl be the minimum between 
the existing values of colunm d 3 in the signature matrix and the values of 
the corresponding hash functions. For instance, 

MinHashSig[/ii, < 4 ] = min(MlNHASHSlG[/ii, ^3], h \) = min(oo, 5 ) = 5 . 

Thus, the signature matrix MinHashSig after the first row has been 
processed is 



di 

d 2 

d 3 

di 

d 3 

hi 

00 

00 

5 

OO 

OO 

h 2 

00 

00 

2 

00 

00 

h 3 

00 

00 

4 

00 

00 

hi 

00 

00 

6 

00 

00 


For the index value 2 the hash values are hf = /ii( 2 ) = 8, h 2 = h 2 ( 2 ) = 1 , 
/if = /13(2) = 13 , and /if = /14(2) = 2 . In this case, all columns, except 
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di, can be updated since all those documents have the second bit set. 
The columns d 2 , di, and d 3 simply get the corresponding hash values 
because there were no prior values for them (oo in the signature matrix). 
However, for d 3 we need to compare the existing values with the current 
values of the hash functions to choose the smallest for each row, for instance, 
MinHashSig[/i 2 , d 3 } = min(MlNHASHSiG[/i 2 , <k\, /i|) = min(2,1) = 1. 

The matrix has the following form: 



di 

d 2 

d 3 

d^ 

d 3 

h 

OO 

8 

5 

8 

8 

h 2 

OO 

1 

1 

1 

1 

h 3 

00 

13 

4 

13 

13 

h 4 

00 

2 

2 

2 

2 


Skipping ahead, this is the signature matrix after processing 14 index 
positions. 



di 

d 2 

d 3 

di 

d 5 

h 

7 

3 

0 

3 

3 

^2 

6 

1 

0 

0 

0 

h 3 

5 

4 

3 

4 

3 

/14 

2 

0 

0 

0 

0 


Next, we continue on to process index value 15. At this position all 
documents, except d 3 , have corresponding bits set. The values of the hash 
functions are /i| 5 = /ii(15) = 6, h\ b = / 2 - 2 (15) = 18, h\ 5 = / 13 ( 15 ) = 9, and 
h\ b = /14(15) = 14. For instance, 

MinHashSig[/ii, di] = min(MlNHASHSiG[/ii, d\}, h\ 5 ) = min( 7 , 6) = 6, 

which means we need to change the corresponding value in the signature 
matrix. 



di 

d 2 

d 3 

di 

d 3 

h 

6 

3 

0 

3 

3 

h 2 

6 

1 

0 

0 

0 

h 3 

5 

4 

3 

4 

3 

hi 

2 

0 

0 

0 

0 


If we process further index values, we can see that no actual upclates are 
possible, meaning that the signature matrix above is the final one. 
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In fact, every permutation defines a MinHash function that is applied 
to the documents. It was proven that a family of such functions is an LSH 
family and the probability of collision over all permutations is equal to 
the Jaccard similarity: 

Pr (min(7t(dA)) = min(7i(d B ))) = J(^A, d B ). (6.4) 

Thus, to estimate the Jaccard similarity between two documents, it is 
enough to compute the fraction of the MinHash signatures for which two 
corresponding columns have the same value (collide) in the signature 
matrix MinHashSig. While we are looking for hash collisions, it is 
possible that no identical values in either row are found, then we can 
assume that the documents are dissimilar. 


Example 6.12: Similarity between signatures 

Consider the signature matrix MinHashSig, that we built in Example 6 . 11 . 



di 

d 2 

d 3 

di 

d 3 

h 

6 

3 

0 

3 

3 

h 2 

6 

1 

0 

0 

0 

h 3 

5 

4 

3 

4 

3 

/14 

2 

0 

0 

0 

0 


For instance, columns d 2 and d 3 share one value out of four signatures and 
the similarity between them is 

Sini]y[ IN HASHSiG(^ 2 j d^) = — = 0.25. 

From Example 6.5 we know that the exact similarity is 0.2857, which is 
pretty close. 

Columns di and d 3 have three out of four values in common, therefore 
the similarity is 

3 

SiniMiNHASHSiG( di , d 3 ) = — =0.75. 

This notably exceeds the exact Jaccard similarity value 0.44, but stili 
indicates the high similarity between documents. 
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In contrast, columns d\ and d 3 share no common values, so the similarity 
is 0 which is the exact value as well. 

Remember, that the value we compute from the signature matrix is 
an approximation for the true value of the Jaccard similarity and depends 
on the signature length. The current length A; = 4 is used for demonstration 
purposes only, and, in fact, is too small to build a close estimation with 
low variance according to the law of large numbers. 


Properties 


There is a ciear trade-off between the similarity estimation error and 
storage. Indeed, the more MinHash functions hi we use, the longer 
signatures we build, and correspondingly the lower expected error 0 in 
the similarity estimation. However, it increases the storage requirements 
for the signature matrix MinHashSig and the number of required 
permutations which can significantly increase the computational efforts. 


The practical guideline on choosing the signature length k based on 
the expected Standard error 0 is 


k 


8 


To store the MinHash signature of a single document using p-bit 
MinHash values, we need p ■ k bits per signature (for instance, p = 32 
allows to enumerate up to 2 32 -1 features) and the memory requirements 
for the whole MinHash signature matrix MinHashSig is p ■ k ■ n bits. 

When the number of documents n is high, storage becomes a problem 
for the algorithm. As a work-around to this problem, Ping Li and Arnd 
Christian Konig in 2010 [LilO] proposed a simple modification of minwise 
hashing, called b-bit minwise hashing. It provides a simple solution by 
storing only the lowest b bits of each p-bit MinHash value, naturally 
reducing the required memory for the signature matrix. 

Intuitively, using fewer bits per MinHash value increases the similarity 
estimation variance, compared with the original minwise hashing for 
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the same signature length k. Thus, it is necessary to increase k to 
maintain the same accuracy. The theoretical results [Lill] demonstrate 
that the signature length k should be adjusted only by a factor of about 
. For most popular cases, when the resemblance is not too small (e.g., 
0 > 0.5 as a cornmon threshold), this is just two to three times bigger. 

If the number of documents is large, which is the reason for using such 
improvements, the theoretical results suggest using b = 1 if the similarity 
threshold is 0 > 0.4, and b > 2 otherwise. Thus, even with the increased 
length of signatures, the total signature matrix size becomes small er with 
b-bit minwise hashing. 


Example 6.13: b-bit minwise hashing 

As an example, for the similarity threshold 0 = 0.5, we can use b = 1, so 
the estimation variance will increase at most by the factor of three and, in 
order not to lose accuracy, it is necessary to adjust the signature length 
respectively. If each MinHash value has been stored initially using 32 bits, 
the improvement by using one-bit values is ^ « 10.67. 

More specifically, replacing the classical MinHash algorithm that uses 
32-bit MinHash values and signatures of length k = 200 by the 1-bit 
minwise hashing, we need longer signatures of length k = 3 • 200 = 600, 
but greatly decrease the memory requirements from 800 bytes to 75 bytes 
per document. 


Perhaps the most important advantage of b-bit minwise hashing are 
simplicity and minimal modihcations to the original minwise hashing 
algorithm, therefore, it could be used to optimize already running Systems. 


Nearest neighbors search 

While MinHash signatures let us represent documents in a compressed 
form using a space-efhcient MinHashSig data structure that preserves 
the similarity information, there is stili a quadratic number of pairs, 
which, as we already estimated, are unfeasible to process quickly for huge 
datasets of millions of documents. For instance, if a dataset consists of 
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one million documents, the number of pairs is 5 • 10 11 and, doing as high 
as 10' comparisons per second, it requires about 14 hours to finish. 

According to the generic LSH schema, to hnd the nearest neighbors 
for a given document, we need to take a number of independent locality- 
sensitive hash functions and apply them to the dataset in order to 
compute a key for each document, that is used to group them. 

However, if documents are already represented as a MinHash signature 
matrix, it is enough to split all rows into b bands, select only one 
conventional hash function g (e.g., MurmurHash3), and apply it to 
the portion of each column within the band. Every band corresponds 
to a subset of features and we hash the documents only looking at that 
subset. Thus, two documents end up in the same bucket (have equal hash 
values) only if they are exact in that band or when a collision happens, 
that is rare for conventional hash functions and will be eliminated at 
the last step. In other words, two documents will appear in the same 
bucket if there is at least one band where their signature values are 
identical. 

By choosing the number of bands b appropriately, we eliminate many 
document pairs with similarities below the threshold 0. Intuitively, 
the more similar the signatures are, the more likely they will agree on all 
rows in some band and becorne a candidate pair. 

Example 6.14: MinHash LSH schema 

Let’s divide the signature matrix built in Example 6.11 into 6 = 2 bands 

with two rows each: 




di 


dz 

di 

<4 

TJ 

hi 

6 

3 

0 

3 

3 

a 

«3 

43 

h 2 

6 

1 

0 

0 

0 

(M 

Td 

ha 

5 

4 

3 

4 

3 

fl 

«3 

43 

64 

2 

0 

0 

0 

0 


The last two columns in band 1 are identical, so regardless of the particular 
hash function, the documents d 4 and d$ becorne a candidate pair. Similarly, 
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documents d 2 and di, d% and d§ that ha ve identical values in band 2 also 
become candidate pairs. 

Considering the similarity threshold 0 = 0.3, we eliminate false positive 
candidate pairs by computing the exact similarities between documents in 
each pair and compare it with the threshold: 

J(d 4 , d 5 ) = 0.44 > 0, 

J(d 2 , di) = 0.27 < 0, 

J(d 3 , <k) = 0-307 > 0. 

Only the pairs d 4 ( measles ) and d$ ( roseola ), d 2 (common cold) and d 4 
(roseola ) can be returned as near-duplicates for the given threshold, which 
was our expectation as well. 

Note that we choose the similarity threshold to eliminate duplicates quite 
randomly; however, there is a relation between the number of bands, 
the length of the signature, and the threshold. 


To successfully apply a banding strategy, we need to have 
a recommendation for the number of bands, this is dependent on 
the similarity threshold 0 we want to use to distinguish similar 
documents. Intuitively, if we have too many bands it is more likely that, 
for at least one small portion, many documents will become candidate 
pairs (an increased number of false positive errors), while for a few 
bands we need to compare long subsequences of signatures that are 
likely to differ in a few values even for documents that are very alike 
and we can rniss many similar documents. 

As soon as all candidate pairs are built, we execute the last step of 
the LSH schema and compute the exact similarity between the documents 
to eliminate false positive results. 


The LSH approach is very sensitive to the similarity distribution between 
documents in the dataset. If the dataset is skewed and most documents are 
similar to each other, we may find that all documents fall in one bucket, 
while other buckets remain empty. 
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Suppose, that a particular pair of documents have similarity s, then 
the probability P that the signatures agree in all rows of at least one 
band is 

P = l-(1- S !) 6 , (6.5) 

where b is the number of bands, k is the length of MinHash signatures; 
so the t is the number of rows in each band. 

The graph of the probability that documents with similarity s become 
candidate pairs according to (6.5) is an S-curve, meaning its values are 
very low until it reaches a step, then its values quickly increase and stay 
very high. 



j 


According to formula (6.3), we want to find the parameters when that 
step occurs close to the threshold 0, giving conditions for b and k: 



For example, the graph above is built for signatures of length k = 50 
with b = 10 bands, five rows each. The approximate step value is 0.63 
which is the similarity threshold at which documents are considered 
similar. 


Generally, for the given signature length k and similarity threshold 0, 
the number of bands can be estimated as 


b 


W (—fc-ln 0) 

C 5 


( 6 . 6 ) 
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where W(-) is the Lambert W function, which cannot be expressed in 
terms of elementary functions, but can be approximately computed as 
an iterative process 4 : 

Wnext = w 1 , 1 ' (W^ rev - k • ln 0 • e- W —). 

W prev i" -L 


Example 6.15: Similarity threshold estimation 
In the previous example, we used b = 2 bands with signatures of length 
k = 4. This setup corresponds to threshold 0 = 0.707, meaning that 
documents with similarity of at least 70% are likely to become a candidate 
pair after applying the bucketing from Example 6.14: 


0 « 



= 0.707. 


On the otlrer lrand, if we want to estimate the required nurnber of buckets 
for k = 4 and the similarity threshold of 0.707 using (6.6), we need to 
compute the Lambert function W(-4 • ln (0.707)): 


Iteration 

W 

b 

1 

1.3868 

4 

2 

0.9510 

2 

3 

0.6948 

2 

4 

0.6933 

2 

5 

0.6933 

2 


As you can see, the iterative process converges quite quickly and 
the recommended nurnber of bands is 2. 


However, since we are using very short signature lengths k = 4, 
the Standard error 6 = 0.11 according to formula (6.2), and similarity for 
the true candidates can be approximated much lower than its true level, 
lrence they can end up in different buckets. If we want to be more precise 
and maintain a Standard error 5 about 0.05 with the similarity threshold 
0 = 0.7, we need to use signatures of length: 


k = 


y/Q.7 ■ 0.3) 
0.05 


= 10 . 


prev 


As an initial value we can use W, 


= 0, meaning only one band 
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The MinHash algorithm is very efficient for huge datasets and can 
be easily applied to the MapReduce computation model that makes 
it popular in Big Data applications. Its various implementations are 
available in Apache Spark, Apache Mahout, Apache Lucene and used 
in search engines and databases such as Elasticsearch, Apache Solr, 
CrateDB, and others. Google was reportedly using it for Google News 
personalization. 


6.3 SimHash 

Another popular hashing algorithm is SimHash, a sign normal random 
projectiori algorithm, that is based on the simhash function developed by 
Moses S. Charikar in 2002 [Ch02] and applied by Gurmeet Singh Manku, 
Arvind Jain, and Anish Das Sarma in 2007 [Ma07] to solve the problem 
of detection of near-duplicate web pages in Google. 

From the mathematical point of view, the SimHash uses the concept 
of sign random projections. For a fc-dimensional real-valued document- 
vector d it defines a similarity-preserving SimHash function family {/i“ m } 
that for the random vector v with components generated from i.i.d normal 
(i.e., Vi oc N(0,1)), produces a value as 

K m (d) := sign(> • d) = J 1 ’ V d ~ (6.7) 

1 ^ 0 , v ■ d < 0 . 

Thus, the SimHash value is the sign of the random projection and since 
the hyperplane with a normal vector v separates the multidimensional 
space into two half-spaces, it encodes just the information on the side 
(positive or negative) where the document is located. 

Example 6.16: SimHash value 

Consider the document-vector from Example 6.7 built for d 3 (flu): 


16 

2 

9 

16 

16 

2 

0 

9 

0 

9 

0 

0 

16 

0 

0 

0 

0 

9 

0 
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To compute its SimHash value, we need to build a vector v of 19 components 
which detines a hyperplane that separates the 19-dimensional space of 
documents. To build such a vector, we generate 19 random values from 
the norrnal distribution N(0,1) and use them as components of the vector 
(since scaling is not important in our case we scale the values by 10): 


5 

-1 

6 

15 

-2 

-2 

16 

8 

-5 

5 

-5 

-5 

2 

-19 

-17 

-6 

-10 

3 

-9 


The dot product of these two vectors is the sum of pairwise products of 
the corresponding components of both vectors: 

v ■ d = 5.12. 

Thus, the sign of the resuit is positive, and the SimHash value is 
h s v im (d) = sign( , u • d) = 1. 


Notice, if two documents have an angle a = n in between, they will 
certainly appear in different half-spaces, and the reverse — documents 
with a perfect alignment that have a = 0, definitely lie in the same 
half-space. Since the magnitude of the document-vectors doesn’t play 
any role in formula (6.7), the probability that two documents and de 
have the same SimHash value is equal to the probability of appearing on 
the same side of the hyperplane, that can be formulated using the angle 
between the documents a = a(dAj ^b) as 

Pr (h™(d A ) = /,-"(*)) = 1 - (6 , 8) 

which defines the probability of hash collision for the SimHash function. 

Such collision probability is closely related to the function cos (a), 
therefore if documents are close to each other in terms of the cosine 
similarity (6.2), they will almost certainly collide, and vice versa. In this 
sense, a family of hash functions preserves the cosine similarity between 
documents and is the locality-sensitive function family for the cosine 
similarity. 
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SimHash signatures 

The variability of using a single SimHash function, with just a single bit, 
is very high and to reduce it we can use p hash functions with different 
random vectors to produce a p-bit vector that is called a SimHash 
signature. Since every hash function preserves the similarity between 
documents, to estimate the similarity between signatures, we need to 
count the nurnber of same-valued corresponding bits in thern. 

Thus, instead of working directly with long and real-valued 
document-vectors, the SimHash algorithm maintains a SimHashTable 
data structure that for every document stores its short fixed-length 
binary SimHash signature, which is conceptually very close to 
the signatures that we built in the MinHash algorithm. 

Every document in the SimHashTable is represented as a p-bit binary 
string, which requires significantly less storage than high dimensional real- 
valued document-vectors, therefore it is a storage-efhcient representation 
of the dataset. 

Consider documents represented by real-valued vectors from the weights 
( wo, wi ,..., Wk~ i) of the document features (so> «i, • • • Sfc—i), or, practically 
speaking, we can think about documents as vectors of tuples {(sj, wy)}y=o• 

To build a p-bit SimHash signature for a document d, first, we hash 
each feature Sj using any conventional hash function h (e.g., 
MurmurHash3, SHA-1) into a p-bit hash value hj = h(sj ) that is going 
to be unique to the particular feature. After that, we start with 
an intermediate p-dimensional zero vector v and, iterating over hash 
values for all the features, we increase the z-th component Vi by 
the weight Wj if z-th bit of the hash value hj is one, and decrease 
otherwise. At the end, when all features have been processed, we 
determine the signs of components of the vector v and set 
the corresponding bits of the final p-bit SimHash signature / to one for 
positive, and to zero for negative components. 
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Algorithm 6.5: Building the SimHash signature table 
Input: Document-vectors d = {(sj, Wj)}jl} 0 
Input: Conventional hash function h 
Output: SimHash signature table 
v ■■= {vi} p i= o, Vi <- 0 
f or j 4— 0 to k - 1 do 
hj <— binaxy(/i(sj)) 
for i •(— 0 to p - 1 do 

/* hj[i\ G {0,1}, we either increment or decrement V{ */ 

Vi <r~ Vi + (2 • hj [f] - 1) • Wj 

return sign(r) 


Example 6.17: SimHash signature table 

Consider the dataset from Example 6.7. For simplicity, we build 6-bit 
SimHash signatures and to compute hashes from all the features of 
the universal set H, we use a randomly chosen 32-bit hash function 
Murmur Hash3: 


h(x) := MurmurHash3(a;) mod 2 6 . 
Thus, the hashes of the features are 


Feature 

Symptom 

h(s) 

binary(h(s)) 

so 

aching body 

56 

000111 

Sl 

cough 

9 

100100 

S2 

diarrhea 

14 

011100 

S3 

difficulty sleeping 

41 

100101 

s 4 

feeling tired 

17 

100010 

S5 

fever 

43 

110101 

Se 

greyish-white spots 

7 

111000 

S7 

headache 

5 

101000 

S8 

itchiness 

26 

010110 

S9 

loss of appetite 

37 

101001 

SlO 

loss of taste 

24 

000110 

$11 

muscle aches 

13 

101100 
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Feature 

Symptom 

h(s) 

binary (h(s)) 

«12 

nausea 

6 

011000 

«13 

rash 

38 

011001 

«14 

red eyes 

62 

011111 

«15 

runny nose 

18 

010010 

Sl6 

sneezing 

27 

110110 

«17 

sore throat 

46 

011101 

Sl8 

swollen glands 

4 

001000 


Similar to that example, we can build a real-valued representation of all 
documents in the dataset using feature weights: 


d\ 

0 

0 

0 

0 

0 

0 

0 

0 

54 

0 

0 

0 

0 

0 

0 

7 

31 

0 

0 

d>2 

0 

3 

0 

0 

0 

3 

0 

11 

0 

0 

20 

20 

0 

0 

0 

3 

11 

11 

0 

d 3 

16 

2 

9 

16 

16 

2 

0 

9 

0 

9 

0 

0 

16 

0 

0 

0 

0 

9 

0 

C?4 

0 

4 

0 

0 

0 

4 

27 

0 

0 

0 

0 

0 

0 

15 

27 

4 

0 

0 

0 

d 3 

0 

3 

13 

0 

0 

3 

0 

0 

0 

13 

0 

0 

0 

13 

0 

3 

0 

0 

23 


Let’s build a signature for document d 3 (flu). We start iterating over all 
features, using their binary representations, and for each feature, we build 
a value based on the document weights. 

The intermediate vector v is a vector of 6 components that are all equal 
to zero at the beginning: 


0 

0 

0 

0 

0 

0 


We start computing components of the vector v by iterating over all 
features. For instance, the binary representation of the feature so has zeros 
in positions 0,1, and 2, thus we decrease the corresponding components 
of vector v by the feature weight Wq = 16 that can be found in the first 
colurnn for the document d 3 in the table above. For positions 3,4, and 
5, where the feature’s hash value has ones, we add the feature weights 
instead: 


-16 

-16 

-16 

16 

16 

16 


In the same way, we process the feature Si, that has ones in the positions 
0 and 3, with the corresponding weight wf = 2: 
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-14 

-18 

-18 

18 

14 

14 


Continuing with other features, we get the final form of the vector v. 


4 

-32 

0 

4 

-40 

0 


The actual values in the vector v are not critical and to build a signature 
for the document we need only the signs of the components. If a component 
of vector v is non-negative, the corresponding component of the signature 
is set to one; otlrerwise, it is set to zero. For document d 3 we have negative 
values in positions 1 and 4 only, thus the signature / 3 is 


0 

1 

2 

3 

4 

5 

1 

0 

1 

1 

0 

1 


Following the same format, we process all remaining documents and 
the final SimHashTable is 



0 

1 

2 

3 

4 

5 

di 

0 

1 

0 

1 

1 

0 

d 2 

1 

0 

1 

1 

0 

0 

d-3 

1 

0 

1 

1 

0 

1 

di 

0 

1 

1 

0 

0 

1 

C?5 

0 

0 

1 

0 

0 

0 


The probability that two signatures collide on some bit is equal to 
the collision probability given by formula (6.8). Therefore, two documents 
are considered similar if their signatures differ in at rnost p bit-positions 
or, in other words, the Hamming distance between their signatures is 
at most r), where r\ is a design parameter that is closely related to 
the similarity threshold 0. 


The Flamming distance is widely used in information theory and can be 
seen as a measure of the minimum number of errors that could transform 
one signature into another. For binary strings, the Flamming distance is 
equal to the number of ones after applying the bitwise XOR operation. 
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Example 6.18: Hamming distance between signatures 
Consider the signatures that we built in Example 6.17 and compare 
the documents d 4 ( vieasles) and d§ ( roseola ) using the Hamming distance: 



0 

1 

2 

3 

4 

5 

di 

0 

1 

1 

0 

0 

1 

<4 

0 

0 

1 

0 

0 

0 


The corresponding bits in these two signatures differ in positions 1 and 5. 
Thus, the Hamming distance between them is equal to 2, meaning these 
documents are quite similar to each other, which is not surprising. For 
comparison, the exact cosine similarity between these documents is 

c(d 2: di) = cos(a) = 0.17, 

wliile for the current dataset the similarity threshold 0 = 0.15 can be 
considered reasonable. 


Properties 

While the SimHash function generates a single bit output, the MinHash 
function generates an integer, however, SimHash could be compared 
with the 1-bit minwise hashing schema that also uses a single bit output. 
However, it seems that the MinHash approach outperforms SimHash for 
high similarity thresholds [LilO]. 

In fact, SimHash is a dimensionality reduction technique that maps 
high-dimensional vectors to p-bit signatures, where p is small (usually, 
32 or 64). As was shown experimentally by Gurmeet Singh Manku, 
Arvind Jain, and Anish Das Sarma, 64-bit signatures are enough to 
handle 8 billion (~ 2 34 ) documents. 

Nearest neighbors search 

The SimHash algorithm lets us represent documents as a space-efficient 
SimHashTable of p-bit values that preserve the similarity information, 
but there is stili a quadratic number of pairs that has to be evaluated to 
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compute the Hamming distance and compare it with the threshold rj, for 
huge datasets of millions of documents that is unfeasible to process in 
a timely manner. 

Additionally, to identify the nearest neighbors for a document d with 
signature f d , we need to find all signatures from the SimHashTable that 
differ from f d in at most r) bit-positions, which is known as the Hamming 
distance range query problem 5 and has remained difficult to solve on 
a large scale. 

For similar documents, meaning choosing a small Hamming distance 
threshold r), we can use the Block-Permuted Hamming Search approach 
and split each p -bit SimHash signature into M blocks of about b = 
consecutive bits each. 


Figure 6.1: p-bit SimHash signature split into M blocks 


Instead of comparing the whole signature, we can randomly choose m 
out of M blocks and perform search queries using the exact block-by-block 
comparison to the top bits of the given signature, where parameter m is 
a design parameter related to the Hamming distance threshold r]. 

Every group of selected m blocks defines a new shorter signature value, 
that is about m • b bits and, since the order of the signatures is not 
important, exactly N = ( M ) transformed m-block signatures can be built 
for each original p-bit SimHash signature. 


M. Minsky and S. Papert. Perceptrons. MIT Press, 1969 
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Example 6.19: m-blocks SimHash signatures 

Consider a 64-bit SimHash signature and define the similarity threshold 

in terms of the Hamming distance at r) = 2. 

We can split the signature into M = 5 blocks, each block receives about 
b = |~^"| =13 consecutive bits, for instance, 13, 13, 13, 13, and 12 bits 
per block. If we proceed with a block-by-block comparison with m = 3 
blocks, the total number of ways to choose them is N = (®) = 10, and 
the resulting 3-block signatures contain either 39 bits (or 38, for the last 
block of 12 bits). 


Thus, for every p-bit SimHash signature in the SimHashTable we can 
produce N = (^) m-block signatures and store them in sorted buckets 
{Bi}f =1 . Each bucket B, will be associated with the particular selection 
of m blocks 7t j and the exact number of bits in the stored signatures bi . 


Algorithm 6.6: Bucketing m-block signatures 
Input : SimHash signature table 
Input : Number N of m-block signatures 
Output : Buckets with m-block signatures 
for i •(— 1 to N do 

for fj e SimHashTable do 

fj Ki(fj) 

Bj <— Bj U {/)} 

Bj •(— sort(Bj) 
return {Bi, B 2 ,..., B N } 


When we need to hnd near-neighbors whose p-bit SimHash signatures 
differ in at most r\ bit-positions from the signature f d of the given 
document d, we probe each of N buckets, that can be done in parallel. 
For every bucket B, we hnd all m-block signatures whose bi bits match 
the bi bits of m(f d ). If the total amount of signatures is 2 9 , then on 
average 2 q ~ b such matches are expected in every bucket. 

After that step, to eliminate possible false positive candidates, we 
compute the exact Hamming distance for each signature and check that 
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it doesn’t exceed the r). 


Instead of building new shorter signatures and keeping the SimHashTable 
for the exact Hamming distance computation in the last step, we can 
permute the original SimHash signatures in a way that selects m blocks 
as the upper-most bits in the signature and keeps them in an untouched 
order. Hence, if signatures are permuted in the same way, the Hamming 
distance will not change, and we are stili able to eliminate false positive 
candidates by exact distance calculation. 


Algorithm 6.7: Searching for nearest neighbors 
Input: Document d = ( Sd,w<i ) 

Input: Hamming distance threshold r] 

Output: List of nearest neighbors 
fd <— Signature(d, h) 
neighbors 0 
for i <— 1 to N do 
candidates 0 
for fj G B, do 

if fd[- k} =fj then 
[ candidates candidates U {j} 

for j G candidates do 

if HammingDistance(/j, fd) < r) then 

[ neighbors <— neighbors U { dj } 

return neighbors 


Using the binary search to find matches in each bucket, an individual 
probe could be done in 0(6«) steps, but the number of bits in each block 
bi should be reasonably large to avoid checking too nrany signatures. 

For every p-bit SimHash signature, the total number of buckets for 
the given Hamming distance threshold r] has to be selected as M > r] + 1, 
then for a block-by-block comparison we can use m G [1,M —r]] blocks. 

However, there is a ciear trade-off between the number of blocks m and 






6.3 SimHash 


203 


the number of buckets N for the fixed choice of the SimHash signature 
length p and the Hamming distance threshold rj. If we use more blocks, 
therefore longer signatures, it reduces the query time because there is 
fewer possible matches but increases the required storage. On the other 
hand, with shorter signatures we can reduce the storage, but it requires 
the checking of more matches, which increases the query time. 


To optimize storage usage, it is possible [Ma07] to compress fingerprints 
which can decrease the data structure size by approximately half. 
The compressiori is based on the fact that fingerprints for similar 
documents share some amount of bits, so we can build shorter blocks 
where fingerprints are encoded by storing Huffman codes for the 
most-significant 1-bit positions of their XOR differences. 

SimHash appeared to be popular for approximate nearest neighbors 
searches, but it could be due to the popularity of the cosine similarity, 
for which SimHash can be directly applied. Same as with MinHash, 
the SimHash algorithm suites for the MapReduce model and is widely 
available, but it is mostly in independent libraries. Google was reportedly 
using it for near-duplicate detection in web crawling. 


Conclusion 

In this chapter we considered different approaches to defining similarity 
between documents of any nature. We have learned a very powerful 
framework that addresses the near-duplicate detection problem which is 
extremely important for rnany real-world applications. As to particular 
implementation we considered two very efficient algorithms that are 
widely used in the industry. 

If you are interested in more information about the material covered 
here or want to read the original papers, please take a look at the list of 
references that follows this chapter. 
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This chapter ends our narration about probabilistic data structures 
and algorithms. While it is impossible to cover all the existing amazing 
Solutions, here we wanted to highlight their common ideas and important 
areas of application, including efficient membership querying, counting, 
strearn mining, and similarity estimation. 

Hopefully you found this book useful and learned out of it. 

Thank you very much. 
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probabilistically correct deletion, 36, 55 

q-digest, 138-148, 159 

digest property, 140 
Quantile digest, see q-digest 
Quantile query, 130, 137, 145, 154 



Index 


209 


quotient, 36 
Quotient filter, 36-49 

RadioGatun, 6, 7 

Random algoritm, see Random sampling 
algorithm 

Random sampling algorithm, 131-138 
Range query, 130, 147, 158 
rank, 127, 128, 131 

median, 97, 108, 128 
percentile, 128 
quantile, 127, 130, 131 

e— approximate, 128, 138, 145 

sampling 

random, 131 
reservoir, 131 

Secure Hash Algorithms, 5-7, 83 
SHA-0, see Secure Hash Algorithms 
SHA-1, see Secure Hash Algorithms 
SHA-2, see Secure Hash Algorithms 
SHA-256, see Secure Hash Algorithms 
shingle, 166 


w-shingle, 166 

sign normal random projection 

algorithm, see SimHash 
SimHash, 178, 193-203 
hash function, 193 
signature, 195 
value, 193 
similarity, 163, 198 

cosine similarity, 170, 172, 178, 194 
Jaccard similarity, 167, 169, 178, 
179, 186 

stochastic averaging, 73 

t-digest, 148-160 

... fully-merged, 149, 150 
digest property, 149, 150 
fc-size, 149 

scale function, 149-151, 156 
TF-IDF model, 170 

universal hash function, 2, 182 
universal set, see universe 
universe, 2, 169 



Probabilistic data structures is a common name for data structures based 
mostly on different hashing techniques. Unlike regular (or deterministic) data 
structures, they always provide approximated answers but with reliable ways to 
estimate possible errors. Fortunately, the potential losses and errors are fully 
compensated for by extremely low memory requirements, constant query time, 
and scaling, the factors that become essential in Big Data applications. 

While it is impossible to cover all the existing amazing Solutions, this book is 
to highlight their common ideas and important areas of application, including 
membership querying, counting, stream mining, and similarity estimation. 


y Learn how to solve practical issues of massive data handling 
y Master the theoretical aspects of probabilistic data structures 
y Identify the right data structures for your particular problems 


The purposeofthis bookisto introducetechnologypractitioners which includes 
Software architects and developers, as well as technology decision makers to 
probabilistic data structures and algorithms. Reading this book, you will get 
a theoretical and practical understanding of probabilistic data structures and 
learn about their common uses. 
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